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SECTION  1 


INTRODUCTION 


Several  methods  for  modeling  and  analysis  of  parallel  algorithms  and 
architectures  have  been  proposed  in  the  recent  years.  These  include 
recursion-type  methods,  like  recursion  equations,  z-transform  descriptions 
and  'do-loops'  in  high-level  programming  languages,  and  precedence-graph- 
type  methods  like  data-flow  graphs  (marked  graphs)  and  related  Petri-net 

i' 

derived  models  [1],  [2).  Most  efforts  have  been  recently  directed  towards 
developing  methodologies  for  structured  parallel  algorithms  and 
architectures  and,  in  particular,  for  systolic-array-like  systems  133—13.03 . 
Some  important  properties  of  parallel  algorithms  have  been  identified  in  the 
process  of  this  research  effort.  These  include  executability  (the  absence 
of  deadlocks)  pipelinability.  regularity  of  structure,  locality  of 
interconnections,  and  diawnsionality.  The  research  has  also  demonstrated 
the  feasibility  of  multirate  systolic  arrays  with  different  rates  of  data 
propagation  along  different  directions  in  the  array. 

In  this  final  report  we  present  ,a  new  methodology  for  modeling  and 
analysis  of  parallel  algorithms  and  architectures.  Our  methodology  provides 
a  unified  conceptual  framework,  which  we  cell  modular  comnutlna  network, 
that  clearly  displays  the  key  properties  of  parallel  systems.  In 
particular, 

(1)  Executability  of  algorithms  is  easily  verified. 

(2)  Schedules  of  execution  are  easily  determined.  This  allows  for 
simple  evaluation  of  throughput  rates  and  execution  delays. 

(3)  Both  synchronous  and  asynchronous  (self-timed)  modes  of  execution 
can  be  handled  with  the  same  techniques. 

(4)  Algorithms  are  directly  mappable  into  architectures.  No  elaborate 
hardware  compilation  is  required. 

(5)  The  description  of  a  parallel  algorithm  is  independent  of  its 
implementation.  All  possible  choices  of  hardware  implementation 
are  evident  from  the  description  of  a  given  algorithm.  The 


equivalence  of  existing  implementation*  can  be  readily 
demonstrated. 

(6)  Both  regular  and  irregular  algorithms  can  be  modeled.  Models  of 
regular  algorithms  are  significantly  simpler  to  analyse,  since 
they  inherit  the  regularity  of  the  underlying  problem. 

Our  methodology  is  largely  based  upon  the  theory  of  directed  graphs  and  can, 
therefore,  be  expressed  both  informally,  in  pictorial  fashion,  and  formally, 
in  the  language  of  precedence  relations  and  composition  of  functions.  This 
duality  will,  hopefully,  help  to  bridge  the  gap  between  the  two  schools  of 
research  in  this  field.  An  outline  of  a  formal  language  representation  for 
modular  computing  networks  is  also  provided. 

The  multiplicity  of  possible  hardware  implementations  for  a  given 
computational  scheme  is  efficiently  displayed  by  a  space- time 
representation,  a  notational  tool  that  has  been  incorporated  into  some 
recent  methodologies  for  modeling,  analysis  and  design  of  parallel 
architectures  [23-31].  Coordinate  transformations  of  a  given  space-time 
representation  produce  distinct  hardware  configurations  which  are  eouivalent 
in  the  sense  of  being  the  implementations  of  the  same  computational  scheme. 
The  problem  of  mapping  a  given  algorithm  into  a  desired  hardware 
configuration  can.  therefore,  be  partly  reduced  to  choosing  the  appropriate 
coordinate  transformation  in  space-time.  In  particular,  uniform  recurrence 
relations,  which  correspond  to  systolic-array  architectures,  are  described 
by  regular  space-tiaw  representations.  This  implies  that  only  linear 
coordinate  transformations  are  required,  and  that  the  entire  computational 
scheme  can  be  described  by  a  small  collection  of  vectors  in  space-time,  the 
dependence  vectors  [25,27,28,30].  Consequently,  the  selection  of  a  desired 
hardware  architecture  for  a  given  algorithm  reduces  to  the  determination  of 
an  appropriate  nonsingular  matrix  with  integer  entries. 

A  simple  technique  for  transforming  a  given  3-dimensional  space-time 
representation  into  an  equivalent  canonical  form  is  presented  in  Sections  5- 
6.  A  catalogue  of  canonical  forms  is  constructed,  showing  a  total  of  34 
distinct  systolic  architectures.  The  task  of  selecting  an  appropriate 
transformation  for  a  given  space-time  representation  reduces,  therefore,  to 
the  determination  of  the  equivalent  canonical  form.  The  important  result, 
which  has  been  overlooked  in  previous  research,  is  that  the  canonical 
equivalent  of  any  given  space-time  representation  is  unique .  This  means 


that  oaea  a  spsce-tiae  representation  hat  bean  apecified  there  ia  no 
flexibility  left  in  the  proceaa  of  napping  it  into  syatol ic-array 
arehiteotnrea. 

A  snail  fraction  of  apace-tiaie  repreaentation  doe*  allow  a  one 
flexibility  in  selecting  the  hardware  architecture,  bnt  only  at  the  cost  of 
inefficient  inplementation.  The  well-known  example  of  natrix 
multiplication,  which  has  four  distinct  realizations  (see  [4,5,7,101)  turns 
out  to  be  one  of  the  few  cases  where  such  flexibility  is  available.  A 
closer  exaaiination  of  the  structure  of  the  matrices  to  be  aul tipi led  reveals 
that  each  realization  is  efficient  under  a  different  set  of  structural 
assuaptions  (see  Section  6.3).  Thus,  in  sunaary,  carefully  specified 
algorithms  lead  to  unique  space-tiae  representations  which,  in  turn,  lead  to 
essentially  unique  architectures. 


SECTION  2 


MODELING  PARALLEL  ALGORITHMS  AND  ARCHITECTURES 


The  concepts  of  'algorithm'  and  'architecture,'  which  have  been  widely 
nsed  for  several  decades,  still  seem  to  defy  a  formal  definition.  Books  on 
computation  and  algorithms  either  take  these  concepts  for  granted  or  provide 
a  sketchy  definition  using  snch  broad  terms  as  'precise  prescription,' 
'computing  agent,'  'well-nnderstood  instructions,'  'finite  effort'  and  so 
forth.  The  purpose  of  this  section  is  to  provide  a  simple  formal  model  for 
modeling  and  analysis  of  (parallel)  algorithms  and  architectures.  This 
model,  which  we  call  modular  computing  network  (MCN)  exhibits  all  the 
properties  usually  attributed  both  to  algorithms  and  to  hardware 
architectures.  As  a  first  step  toward  the  formal  introduction  of  this  model 
we  extract  in  Section  2.1  the  main  attributes  of  algorithms  from  their 
characterizations  in  the  literature.  This  analysis  of  literature  leads  to 
the  conclusion  that  algorithms  can  only  be  defined  in  a  hierarchical  manner, 
i.e.,  as  well-formed  compositions  of  simpler  algorithms,  and  that  the 
simplest  (non-decompo sable  algorithms)  cannot  and  need  not  be  defined.  The 
building  blocks  of  the  theory  of  algorithms  are  characterized  in  terms  of 
three  attributes:  Function  (what  building  blocks  do),  execution  time  (how 
long  they  do  it),  and  complexity  (what  does  it  cost  to  use  them).  These 
observations  are  incorporated  into  the  modular  computing  network  model,  as 
described  in  Sections  2.2  -  2.6. 


2.1  TOWARD  A  FORMAL  DEFINITION  OF  ALGORITHMS  AND  ARCHITECTURES 

In  this  section  we  attempt  to  extract  the  main  attributes  of  algorithms 
and  architectures  from  a  randomly  chosen  sample  of  'definitions.'  Most 
characterisations  of  algorithms  are  geared  to  the  notion  of  sequential 
execution.  Nevertheless,  we  shall  see  that  this  underlying  assumption  is 


almost  never  Bade  explicit.  As  a  result,  the  attributes  of  parallel 
algorithms  are,  in  fact,  included  in  the  available  characterizations. 

As  a  typical  example  consider  the  following  definition.  'The  term 
'algorithm'  in  mathematics  is  taken  to  mean  a  computational  process,  carried 
out  according  to  a  precise  prescription  and  leading  from  given  objects, 
which  may  be  permitted  to  vary,  to  a  sought-for  result*  [111.  This 
definition  simply  states  that  an  algorithm  is  a  well-defined  input-output 
map  and  that  its  domain  contains  at  least  one  element,  and  usually  more  than 
one.  However,  the  term  'computational  process'  hints  that  an  algorithm  is 
more  than  just  a  well-defined  function.  Indeed,  'A  function  is  simply  a 
relationship  between  the  members  of  one  set  and  those  of  another.  An 
algorithm,  on  the  other  hand,  is  a  procedure  for  evaluating  a  function' 

[121. 

But  how  are  functions  evaluated?  Ve  are  told  that  'this  evaluation  is 
to  be  carried  out  by  some  sort  of  computing  agent,  which  may  be  human, 
mechanical,  electronic,  or  whatever*  [12].  Thus,  the  emphasis  is  on 
physical  realizability  (the  existence  of  a  'computing  agent')  but  not  on  the 
actual  details  of  the  realization.  The  first  axiom  of  the  theory  of 
algorithms  is.  therefore: 

There  exist  basic  functions  that  are  physically  realizable. 

Further  efforts  to  define  physical  realizability  turn  out  to  be  quite 
futile.  This  is  recognized  by  Aho,  Hopcroft  and  Ullman  who  say,  'each 
instruction  of  an  algorithm  must  have  a  'clear  meaning'  and  must  be 
executable  with  a  'finite  amount  of  effort.'  Now  what  is  clear  to  one 
person  may  not  be  clear  to  another,  and  it  is  often  difficult  to  prove 
rigorously  that  an  instruction  can  be  carried  out  in  a  finite  amount  of 
time*  [13].  Physical  realizability  is  a  matter  of  technology:  What  is  non- 
realizable  today  may  become  realizable  in  a  year  or  two.  The  theory  of 
algorithms  has  to  assume  the  existence  of  realizable  basic  input-output  maps 
but  need  not  be  concerned  with  the  details  of  their  implementation. 
Therefore,  the  core  of  any  theory  of  algorithms  is  a  non-empty  collection  of 
undefined  objects,  which  we  shall  call  processors .  These  are  the  'computing 
agents'  mentioned  above,  and  they  are  assumed  to  have  three  attributes: 


6 


[*l 


(i)  Function  (an  input-output  sap) 

(ii)  Complexity  measure 

(ill)  Execution  time 

A  processor  is  assumed  to  be  capable  of  evaluating  the  input-output  map  in 
the  specified  execution  time.  The  cost  of  utilizing  the  processor  is 
specified  by  its  complexity  measure.  Notice  that  the  notion  of  'effort' 
mentioned  above  is  a  combination  of  the  processor's  complexity  and  its 
execution  time. 

It  is  important  to  draw  a  distinction  between  an  algorithm  and  its 
description.  An  algorithm  consists  of  processors  (or  basic  functions), 
corresponding  to  all  the  functions  that  need  to  be  evaluated.  For  instance, 
the  computation  of  sin  x  via  the  first  100  terms  of  its  MacLaurin  series 
requires  100  basic  functions,  one  for  each  term  of  the  truncated  series. 

The  description  of  the  same  algorithm  in  terms  of  Instructions  requires  only 
one  instruction,  which  will  be  repeated  100  times  with  varying  coefficients. 
Since  descriptions  of  algorithms  need  to  be  communicated,  stored  and 
implemented,  they  must  be  finite,  i.e.,  contain  a  finite  number  of 
instructions.  The  algorithm  itself,  on  the  other  hand,  may  consist  of  an 
infinite  number  of  processors,  and  used  to  process  an  infinite  number  of 
inputs  into  an  infinite  number  of  outputs.  Such  are,  for  instance,  most 
signal  processing  algorithms:  Their  inputs  and  outputs  are  time-series 
which  may,  in  principle,  be  infinitely  long.  The  executability  of  these 
algorithms  depends  upon  their  capability  to  compute  any  specific  output  with 
finite  time  and  effort,  and  to  use  only  a  finite  number  of  inputs  for  this 
purpose.  This  observation  also  sheds  a  new  light  on  the  concept  of 
'termination,'  which  is  usually  overemphasized  in  definitions  of  algorithms. 

The  basic  functions  comprising  an  algorithm  are  interdependent  in  the 
sense  that  the  outputs  of  one  processor  may  serve  as  inputs  to  other 
processors.  A  complete  characterization  of  an  algorithm  requires, 
therefore,  to  specify  both  its  basic  operations  and  the  interconnection 
between  these  operations.  The  same  statement  applies,  of  course,  to  block- 
diagram  representations  of  hardware,  to  flow-graphs  and,  in  fact,  to  any 
network-type  schematic.  Vhile  algorithms  are  commonly  described  in  some 


formal  language,  they  can  also  be  described  in  a  schematic  manner. 
Conversely,  schematic  hardware  descriptions  can  be  transformed  into  formal 
language  representations.  To  emphasize  this  equivalence  we  shall  introduce 
the  concept  of  a  modular  comnutina  network  (MCN) ,  which  exhibits  the  common 
attributes  of  both  algorithms  and  architectures.  Thus,  an  MCN  is  a  pair 

M  -  {././;) 

where  the  function  of  the  network,  is  essentially  the  collection  of 

basic  functions  discussed  above,  and  'S ,  the  arch itecture  of  the  network, 
is  a  directed  graph  describing  the  interconnections  between  basic  functions. 
A  detailed  definition  is  provided  in  Section  2.1. 

The  concept  of  modular  computing  network  is  hierarchical  b-  mature. 
Basic  functions  can  be  themselves  characterized  as  networks  r  »ven  more 
basic  functions.  This  requires  every  MCN  to  have  the  three  damental 
attributes  of  a  basic  function:  Input-output  map.  complexit  a  execution 
time.  Ve  shall  show  in  the  sequel  how  to  uniquely  associate  such  attributes 
with  modular  computing  networks.  The  theory  of  MCNs  is,  in  short,  the 
theory  of  network  composition  (deducing  the  properties  of  a  network  from  its 
components)  and  network  decomposition  (characterizing  the  components  and 
structure  of  a  network  whose  composite  properties  have  been  specified). 


2.2  MODULAR  COMPUTING  NETWORKS 

A  modular  computing  network  (MCN)  is  a  system  of  interconnected 
modules.  The  structural  information  about  the  network  is  conveyed  by 
specifying  the  interconnections  between  the  modules,  most  conveniently  in 
the  form  of  a  directed  graph  (Figure  2-1).  The  functional  information  about 
the  network  is  conveyed  by  characterizing  the  information  transferred 
between  modules  and  the  processing  of  this  information  as  it  passes  through 
the  modules. 

The  structural  attributes  of  an  MCN  are  completely  specified  by  its 
architecture .  which  ia  an  ordered  quadruple 

Architecture  *  {S,  T,  A,  P]  (2.2) 
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where  S.T  ere  sets  whose  elements  are  called  aonrccs  and  sinks . 

respectively,  and  A,P  are  relations  between  these  sets.  _ 

The  ancestry  relation  A  specifies  the  connections  of  sources  to 
sinks.  The  elements  of  A,  which  are  called  arcs,  are  ordered  source-sink 
pairs 

a  e  A  ==>  a  =  (s,t),  a  a  S,  t  e  T  (2.3) 

An  arc  represents  a  direct  transfer  of  information  from  source  to  sink.  Two 
basic  assumptions  govern  this  transfer: 

(1)  There  are  no  dangling  sources.  Every  source  is  connected  to 
exactly  one  sink. 

(2)  There  are  no  dangling  sinks.  Every  sink  is  connected  to  exactly 
one  source . 

These  assumptions  mean  that  the  three  sets  S.T, A  have  an  equal  number  of 
elements,  and  that  the  ancestry  relation  A  establishes  a  one-to-one 

correspondence  between  arcs,  sources  and  sinks,  viz.,  — 

..  i 

(s,t)  e  A  <=“>  s  *  A(t)  <«=>  t  ■  A  *(s)  (2.4) 

This  one-to-one  correspondence  will  permit  us  to  identify  in  the  sequel  each 

i 

arc  with  its  associated  source  and  sink,  and  to  eliminate  almost  all  sinks 
and  sources  from  the  description  of  network  architectures. 

The  nrocessina  relation  P  specifies  the  processing  of  information 
extracted  from  sinks  into  transformed  information,  which  is  re-injected  into 
sources.  The  elements  of  P,  which  are  called  processors,  are  ordered 
pairs  of  non-empty  finite  sink-source  sequences,  viz., 

p  a  P  **)  p  m  (t.ft.  ,.ii,t  i  a. ,  s. , . . . ,  a  )  (2.3) 

X  Z  ®  X  z  » 

4i  •  T,  *i  *  S,  1  £  m,  n  <  • 

i 
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The  input  set  ( t  .  t _ ,  . ...  t  )  consists  of  all  the  sinks  from  which  the 
12  n 

processor  p  extracts  informs  lion.  The  transformed  information  is 

distributed  among  the  members  of  the  output  set  ( s, ,  a_,  ...»  a  ).  The 

12  n 

one-to-one  correspondence  oetween  sources,  sinks  and  arcs  allows  us  to 
describe  processor  inputs  and  ontpnts  in  terms  of  arcs  and  to  almost 
completely  eliminate  the  notion  of  sources  and  sinks.  The  set  of  inpnt  arcs 
of  a  processor  p  is  denoted  by  A^(p)»  and  the  set  of  output  area  from 
the  same  processor  is  denoted  by  Ag(p).  Each  processor  is  assumed  to  have 
unique  inputs  and  outputs,  namely 

(  A^p)  O  AA(q)  ■=  4> 

p.»  .  P.  P#1  -->  x  <  ,  ^  A  ,  „  .  «,  (2,6) 

^  o  o 

Similarly,  every  collection  of  processors,  Q  Cl  P*  has  its  uniquely 
defined  inputs  and  outputs,  viz.. 


A.  (Q)  :«  U  A.  (p)  -  \J  A  (p)  (2.7a) 

peQ  peQ 


and 


A  (Q)  U  A  <P>  “  \J  K(P)  (2.7b) 

°  peQ  °  peQ  1 


In  other  words,  the  inputs  of  Q  are  those  inputs  of  processors  in  Q  that 

are  not  connected  to  outputs  of  processors  in  Q.  A  similar  statement  holds 

for  outputs  of  Q.  In  particular,  A. (P),  A  (P)  are  the  inputs  and 

i  o 

outputs  of  the  entire  network. 

Network  architectures  are  most  conveniently  described  by  a  directed 
graph  that  combines  together  the  ancestry  relation  A  and  the  processing 
relation  P  into  a  single  block-diagram- like  representation  (Figure  2-2a). 
Sources  and  sinks  are  denoted  by  semi-circles,  processors  by  circles  and 
area  are,  obviously,  denoted  by  arcs.  Sources  and  sinks  are  paired,  and 
each  processor  has  its  inputs  and  outputs  adjacent  to  itself.  An  obvious 
reduction  in  notation  (Figure  2-2b)  enhances  the  comprehensibility  of  the 
description.  The  reduced  form  is,  essentially,  a  block-diagram 
representation  of  the  network  architecture,  and  can  be  interpreted  as  a 
directed  graph  ® 
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f.  -  (V.  A) 


(2.8*) 


The  ut  of  vertices  V  of  this  graph  is 


V  -  (  A  (P) ,  P.  A  <P)  } 
I  o 


(2.8b) 


where  A^P)  are  interpreted  as  the  sources  corresponding  to  the  input  arcs 
and  Aq(P)  are  interpreted  as  the  sinks  corresponding  to  the  ontpnt  arcs. 
The  arcs  of  the  directed  graph  coincide  with  the  original  set  of  arcs  A. 

The  interpretation  of  network  architectures  as  directed  graphs  puts  at  our 
disposal  the  powerful  tools  and  results  of  graph  theory.  Some  of  these  will 
be  used  in  the  sequel  to  characterise  and  analyze  the  structure  of  Modular 
computing  networks. 

The  functional  attributes  of  an  MCN  are  completely  determined  by  its 
architecture  and  by  specifying  the  functional  attributes  of  each  processor. 
Thus,  the  function  of  a  network  is  an  ordered  pair 


&  -  CX.  F} 


(2.9) 


where  X,  F  are  sets  whose  elements  are  called  variables  and  mpus. 
respectively. 

The  elements  of  X  are  sets  (i.e.,  domains)  and  'assigning  a  value  to 

a  variable*  amounts  to  choosing  a  particular  element  in  the  domain 

corresponding  to  that  variable.  There  is  one  variable,  ,  associated 

with  every  arc  a  a  A  of  the  corresponding  architecture.  Consequently, 

there  is  a  one-to-one  correspondence  between  variables,  sources,  sinks  and 

arcs.  This  correspondence  makes  it  possible  to  refer  to  the  variables 

associated  with  the  inputs  of  a  given  processor  p  as  the  input  variables 

of  p  and  denote  them  by  X,(p).  A  similar  notation,  X  (p),  is  used  for 

i  o 

the  variables  associated  with  the  outputs  of  the  processor  p. 

The  elements  of  F  are  multivariable  maps.  There  is  one  map,  f  , 

P 

associated  with  every  processor  p  s  P  of  the  corresponding  architecture. 
It  maps  the  input  variables  of  this  processor  into  the  corresponding  output 
veriablea,  viz.. 
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(2.10) 


£  :  X  (p)  — >  X  (p) 

pi  o 

which  aeans  that  each  of  the  ontpnt  variables  is  a  function  of  the  input 
variables  (not  necessarily  of  all  the  input  variables).  This  establishes  a 
precedence  relation  between  the  inputs  and  outputs  of  a  given  processor, 
via. , 


x  ->  y  (2.11) 

if  x  e  A^(p),  y  a  A^(p)  and  if  y  is  a  function  of  x  (and,  possibly, 
of  other  input  variables).  The  transitive  closure  of  this  relation  is  also 
a  precedence  (i.e.,  a  partial  order):  Ve  shall  say  that  x,  precedes  x 
if  there  exists  a  sequence  of  variables  such  that 

12  n 

in  the  sense  of  (2.11).  This  global  precedence  will  also  be  denoted  by 
x^  ->  xn.  The  ancestry  [14]  of  a  variable  x  a  X  is  the  set  of  all 
variables  that  precede  x,  via., 

a(x)  :*  (x;  x  e  X,  z  ->  x)  (2.12) 

These  are  all  the  variables  that  have  to  be  known  in  order  to  determine  the 
value  of  x. 

Since  the  function  of  a  network  consisting  of  a  sinale  processor  p  is 

9  -  cx.(p)  U  x  (p).  f  } 

pi  op 

there  is,  essentially,  no  distinction  between  the  function  and  the  bsp  of 
p.  Thus,  the  input-output  asp  of  a  single  processor  nay  also  be  called  the 
function  of  the  processor.  The  sane  is  not  true  for  a  network  consisting  of 
several  processors:  The  input-output  nap  of  a  network  is  a  single 
nultivariable  aap,  relating  the  outputs  of  the  network  to  its  inputs;  the 
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function  of  tho  network,  in  contradistinction,  is  the  collection  of  the 
atomic  naps  that  comprise  the  network.  The  analysis  problem  for 
computational  networks  is  to  determine  the  network  map  from  its  function. 

The  synthesis  problem  is  to  design  an  MCN  (i.e.,  specify  its  structure  and 
function)  that  realizes  a  given  multivariable  input-output  map. 

Modular  computing  networks  need  not  be  finite.  In  fact,  most  signal 
processing  algorithms  correspond  to  infinite  MCNs.  However,  the  concept  of 
finite  effort,  involved  in  the  evaluation  of  variables,  imposes  certain 
constraints  upon  infinite  networks.  First,  the  number  of  inputs  and  outputs 
of  every  processor  must  be  finite.  This  means  that  the  graph  '!}  describing 
the  architecture  is  locally  finite  [15],  Next,  every  variable  must  be 
computable  with  finite  effort,  so  it  will  be  required  to  have  a  finite 
ancestry,  viz., 

I a(x) |  <  •  for  all  x  e  X  (2.13) 

Ve  shall  also  assume  that  the  number  of  connected  components  of  the 
architecture  y  is  countable.  A  modular  computing  network  that  satisfies 
the  three  assumptions  stated  above — local  finiteness,  finite  ancestry  and 
countable  number  of  connected  components — will  be  called  structurally 
finite.  The  following  result  characterizes  the  kind  of  infinity  allowed  in 
such  networks. 


Theorem  ?.;! 

A  structurally  finite  MCN  has  a  countable  number  of  variables  and 
processors.  The  following  three  statements  are  equivalent: 

(1)  The  number  of  variables  is  finite. 

(2)  The  number  of  output  variables  is  finite. 

(3)  The  number  of  processors  is  finite. 
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Proof : 


The  countability  of  the  variables  and  processors  of  a  connected  network 
is  a  direct  consequence  of  local  finiteness  (see,  e.g.,  [IS]).  Since  each 
connected  component  has  a  countable  number  of  variables  and  processors,  the 
same  is  obviously  true  for  a  countable  number  of  connected  components.  Thus 
the  number  of  variables  and  processors  of  a  structurally  finite  MCN  must  be 
countable.  As  a  consequence  of  local  finiteness,  a  finite  number  of 
processors  implies  a  finite  number  of  variables  and  vice  versa,  so  (1)  and 
(3)  are  equivalent.  Clearly  (1)  implies  (2),  while  (2),  via  the  finite 
ancestry  condition,  implies  (1). 


2.3  CAUSALITY  AND  EXECUTIONS 


The  definition  of  processors  in  the  previous  section  did  not  take  into 
account  any  constraints  imposed  by  hardware  implementation  considerations, 
the  most  important  among  these  constraints  is  the  cansality  property.  It 
will  be  henceforth  assumed  that  an  output  of  a  processor  cannot  become 
available  before  the  inputs  of  the  same  processor  that  precede  this  output 
became  available.  In  the  beginning  all  variables  are  unavailable;  the 
inputs  of  the  network  are  made  available  at  a  given  instant,  and  following 
that  event,  all  variables  of  the  network  gradually  become  available.  This 
temporally  ordered  process,  which  we  shall  call  execution,  must  be 
consistent  with  the  precedence  relation  between  variables  induced  by  the 
directed  nature  of  the  architecture  (G.  A  network  that  possesses  an 
execution  in  which  every  variable  ultimately  becomes  available  is  said  to  be 
executable  (or  'live'  in  the  terminology  of  Petri-nets  [1].  It  is  clear 
that  a  network  containing  a  cycle  cannot  be  executable  since  every  variable 
(**  arc)  on  the  cycle  can  never  become  available.  In  order  to  satisfy  the 
causality  assumption  every  variable  in  the  cycle  must  temporally  precede 
itself  (i.e.,  it  must  be  available  before  it  becomes  available  (Fig.  2-3)), 
which  is,  clearly,  impossible.  It  turns  out  that  every  acyclic  architecture 
is  executable.  To  prove  this  result  we  shall  need  to  formalize  the  notion 
of  execution. 
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Figure  2-3 


An  exception  of  en  MCN  is  i  partitioning  of  its  ▼srisbles  into  a 
sequence  of  finite  disjoint  sets,  viz.. 


E  -  (S^  0  i  i  <  -,  IsJ  <  «.  S4n  Sj  -  <J>  for  i  f  j.  Vj  Sj  -  *1 

(2.14a) 


such  that  the  precedence  relation  is  preserved,  viz., 

i-1 

a(S. )  C  U  S.  i  -  0,  1,  ...  (2.14b) 

j«0  3 

Here  a(S)  denotes  the  ancestry  of  the  set  S,  defined  as  the  collection 
of  all  ancestors  of  aeabers  of  S,  viz.. 


o(S) 


u 

xsS 


o(x) 


(2.15) 


In  staple  words,  every  ancestor  of  x  s  S.  Bust  be  contained  in  one  of  the 


eets  S 


0*  S1‘ 


s, 


which  we  shall  call  levels.  Executions  can  be 


•  •  •  I  °l-l  1  _ 

interpreted  as  aultistep  procedures  for  evaluating  all  the  variables  in  X. 
The  aeabers  of  the  level  S,  are  evaluated  at  the  i-th  step,  and  the 


condition  (2.14b)  guarantee*  the  availability  of  all  tbeir  ancestors  at  the 
right  Boment.  Since  the  ancestors  of  the  level  Sj  strictly  precede  Sj 
all  variables  in  this  set  can  be  evaluated  siaul taneously  giving  rise  to  a 
parallel  execution.  If  each  set  contains  exactly  one  variable  the 

execution  vill  be  called  sequential. 

Since  each  level  S^  in  an  execution  it  finite,  the  evaluation  of  the 
variables  in  S^  from  the  members  of  the  preceding  levels  requires  finite 
effort.  Since  each  variable  belongs  to  some  level  Sj,  the  total  effort 
involved  in  the  evaluation  of  a  single  variable  from  the  global  inputs  is 
also  finite.  Thus,  the  existence  of  an  execution  for  a  given  MCN  implies 
that  every  variable  can  be  evaluated  with  finite  time  and  hardware.  A 
network  that  has  an  execution  deserves,  therefore,  to  be  called  executable . 

The  preceding  discussion  implies  that  executability  is  a  structural 
property,  since  only  the  precedence  relation  between  variables  is  involved 
in  constructing  executions.  The  following  result  presents  a  simple 
structural  test  for  executability  of  MQis. 


A  structurally  finite  MCN  is  executable  if,  and  only  if,  its 
architecture  is  acyclic. 

Egfigf: 

If  an  execution  exists,  then  it  can  be  easily  converted  into  a 
sequential  execution  by  ordering  the  variables  in  each  (finite)  level  Sj 
in  soeie  arbitrary  manner.  Thus,  executability  is  equivalent  to  the 
existence  of  a  seuuential  execution.  By  a  well-known  result  is  the  theory 
of  finite  directed  graphs,  a  sequentisl  ordering  exists  if,  and  only  if,  the 
graph  is  acyclic.  Thus,  the  theorem  holds  for  finite  MCNs.  The  proof  for 
infinite  networks  is  given  in  Appendix  A. 


Executable  MCNs  always  have  sequential  executions 


ID 


I 


t 


ggjoj 


The  corollary  confims  the  intuitive  notion  of  executahility:  Any 
computation  that  can  be  carried  out  in  parallel  can  also  be  carried  out 
sequentially.  Parallel  execution  offers,  however,  an  attractive  trade-off 
between  hardware  and  time,  which  will  be  discussed  in  detail  in  Sec.  3.4. 

Theorem  2.2  provides  a  simple  test  for  executahility  and,  in  effect, 
prevents  the  construction  of  non-executable  MCNs.  Thus,  the  pitfalls  of 
starvation  and  deadlocks .  well  known  in  the  context  of  Petri-nets  [1]  are 
easy  to  avoid.  Notice  also  that  since  each  variable  in  an  MCN  is  evalnated 
exactly  once,  safeness  [1]  is  guaranteed.  This  means  that  inputs  to 
processors  do  not  disappear  before  they  have  been  used  to  evaluate  the 
subsequent  outputs.  Safeness  is  achieved  because  once  a  variable  becomes 
available  it  stays  so  forever,  and  never  disappears. 


2.4  HIERARCHICAL  COMPOSITION  OF  MCNs 

Modular  computing  networks  are,  by  definition,  constructed  in  a 
hierarchical  manner.  A  processor  p  in  an  MCN  can  itself  be  a  network, 
provided  it  has  a  well  defined  input-output  map  f^.  In  this  section  we 
analyze  the  constraints  that  have  to  be  imposed  upon  MCN  composition  in 
order  to  guarantee  the  existence  of  a  well-defined  global  input-output  map. 

From  the  structural  point  of  view  a  composition  is  simply  a  network  of 
networks.  The  'processors'  of  the  composite  network  are  MCNs  and  the  arcs 
represent  interconnections  between  outputs  of  MCNs  to  inputs  of  other  MCNs. 
The  architecture  of  the  composition,  obtained  by  regarding  each  MCN 
component  as  a  simple  'processor'  has  to  satisfy  the  constraints  of  Sec. 
2.2.  An  architecture  is  called  admissible  if  it  satisfies  the  three 
following  constraints: 

(1)  No  dangling  inputs  and  outputs 

(2)  No  cycles 

(3)  It  is  structurally  finite 
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The  importance  of  these  constraints  lies  in  the  fact  that  an  admissible 
composition  of  admissible  architectures  is  itself  an  admissible  architecture 
(aee  Appendix  B  for  proof).  It  is  interesting  to  notice  that  the 
admissibility  conditions  are  instrumental  also  in  establishing  other 
important  properties  of  architectures.  In  particular,  an  admisaible 
composition  of  aelf-timed  elements  is  itself  a  self-timed  element  [6].  [71. 

To  establish  the  hierarchical  nature  of  composition  it  is  only 
necessary  to  verify  that  an  admissible  composition  of  processors  with  a 
well-defined  input-output  map  also  has  a  well  defined  input-output  map. 

This  will  be  done  by  interpreting  executions  as  decompositions  of  MCNs  Into 
elementary  parallel  and  aequential  combinations. 

Parallel  composition  of  two  architectures,  ^  and  'i'j,  is  defined  as 
the  union  of  the  two  networks  without  any  interconnections  between  '£^  and 
%2  (Fit*  2-4a).  Sequential  composition  involves  the  connection  of  every 
ontpnt  of  to  a  corresponding  input  of  <£^i  thus  the  number  of  outputs 

of  must  equal  the  number  of  inputs  of  Sj  (Fig.  4-2b).  Ve  shall 

denote  parallel  composition  by  $  '.Ij  and  sequential  composition  by 
•  Kj.  The  parallel  composition  of  a  countable  number  of  admisaible 
networks  is  always  admissible.  The  sequential  composition  of  a  sequence  of 
admissible  networks  is  admissible  too.  i.e., 

<G\  *  v2  *  *  *  * 

is  admissible  because  the  unilateral  nature  of  the  cascade  preserves  the 
finite  ancestry  property,  while  local-f initeness  and  eountability  of 
components  are  clearly  preserved. 


Executions  define  a  rearrangement  of  MCNs  as  a  sequential  coaipoaition 
of  subnetworks,  each  aubnetwork  being  a  parallel  composition  of  processors 
The  MCN  of  Figure  2-1  can,  for  instance,  be  described  as 

<fj  #  e  #  e )•  (e  #  #  e)*(f4  #  e  ft  f3>*(e  #  f}  ¥  e)»(f6  #  e  #  e) 

where  e  is  an  identity  input-output  map.  The  importance  of  this 
observation  lies  in  the  fact  that  the  input-output  map  of  any  sequential- 
parallel  composition  is  well-def in«4.  Consequently,  every  execution  has  a 
well-defined  input-output  map.  This  leads  to  the  following  result. 


Every  execntable  MCN  has  a  unique  well-defined  inpnt-ontpnt  map. 


Prpgf • 


See  Appendix  C. 


The  theorem  establishes  the  utility  of  the  notion  of  execution.  While 
each  execution  corresponds  to  a  different  ordering  of  the  computations 
required  to  evaluate  the  output  variables  of  an  MCN,  all  executions 
determine  the  same  input-output  map.  And,  while  each  execution  provides  a 
different  description  of  the  network,  they  all  correspond  to  the  same  MCN. 

Descriptions  of  computational  schemes  will  be  considered  equivalent  i 
they  determine  the  same  input-output  map.  They  will  be  considered 
structurally  equivalent  if,  in  addition,  they  determine  the  same  MCN. 
Structural  equivalence,  which  amounts  to  different  choices  of  executions, 
leaves  both  the  architecture  and  the  function  of  the  MCN  unchanged.  Other 
types  of  equivalence  transformations  will  affect  both  the  architecture  and 
the  function  of  the  MCN  but  will  keep  its  input-ontput  map  unchanged. 


2.5  COMPARISON  OF  MCNs  WITH  OTHER  NETWORK  MO  DEI.  S 


2.5.1  Block-Diagrams  and  Finite-State  Machines 

Numerical  algorithms  are  most  frequently  described  in  terms  of 
recursion  equations  involving  indexed  quantities,  known  as  signals.  Z- 
transform  notation  and  block  diagrams  (or  signal-flow-graphs)  are  sometimes 
used  as  equivalent  descriptions  of  recursion  equations. 

The  main  difference  between  MCNs  and  Z-transform  block-diagrams  is  in 
the  distinguished  role  of  time  in  the  latter  model.  A  cascade  connection  of 
three  blocks,  each  with  its  own  state  (Fig.  2-5a)  corresponds  to  an  MCN  of 
infinite  length  (Fig.  2-5b).  Each  row  of  the  MCN  represents  a  single  step 
of  the  recursion.  Each  inpnt/output  is  a  single  variable,  not  a  time- 
series.  While  the  MCN  description  seems  wasteful,  it  does  in  fact  enhance 
our  understanding  of  the  various  possibilities  of  implementation.  Moreover, 
MCNs  can  describe  irregular  algorithms  that  cannot  be  described  in  terms  of 
recurrence  equations.  This  means  that  every  block  diagram  can  be  converted 
into  an  MCN  but  not  vice  versa.  The  conversion  amounts  to  duplicating  the 
block  diagram  several  times  (once  for  every  iteration  of  the  recursion)  and 
converting  delay  elements  into  direct  connections  between  consecutive 
duplicates,  as  in  Figure  2-5. 

The  preceding  discussion  considered  only  block-diagrams  that  correspond 
to  sets  of  recursion  equations.  Such  diagrsms  always  consist  of  delay 
elements  and  memoryless  operations.  This  means,  of  course,  that  only  block- 
diagrams  whose  blocks  represent  finite-state  machines  can  be  converted  in  a 
straightforward  manner  into  an  MCN.  Any  other  block-diagram  has  to  be  first 
converted  into  a  state-space  form  (i.e.,  every  block  has  to  be  represented 
by  a  state-space  model  or  a  combination  of  suoh  models)  before  it  can  be 
converted  into  an  MCN.  Thus,  in  particular,  any  signal-flow-graph  with 
rational  transfer  functions  can  be  transformed  into  sn  MCN. 

The  correspondence  between  block-diagrams  and  MCNs  provide  a  simple 
test  for  the  executabil ity  (•=  computability)  of  algorithms  represented  by 
block  diagrams. 


A  finite  block-die gram  (or  signal-f low-graph)  whose  blocks  ere 
characterized  by  delay  elements  and  memoryleas  maps  is  executable  if,  and 
only  if,  the  directed  graph  obtained  by  deleting  delay  elements  from  the 
diagram  (or  equivalently,  by  setting  z  *  **  0  in  the  transfer  functions)  is 
acycl ic. 


Proof : 


Since  delay  elements  are  causal,  they  can  never  give  rise  to  cycles  in 
the  corresponding  NCM.  In  other  words,  since  all  operations  in  the  i-th 
iteration  temporally  precede  all  operations  in  the  (i+l)-th  iteration,  the 
only  cycles  the  MCN  representation  of  a  block-diagram  may  have  must  be 
contained  within  a  single  layer,  corresponding  to  a  single  iteration.  A 
single  layer  of  the  MCN  is  obtained  by  removing  all  delay  elements  from  the 
block-diagram. 


The  test  not  only  establishes  the  ezecutabil ity  of  a  given  block- 
diagram  but  indicates  how  to  transform  non-executable  networks  into 
execntable  ones.  Consider,  for  instance,  the  network  in  Figure  2-6a.  It  is 
non-executable  if  H(»)  ¥■  0,  because  a  cycle  exists  in  the  network  tot 
x  *  •.  However,  the  same  transfer  function  can  be  realized  by  the  network 
in  Figure  2-6b,  which  is  executable. 


2.5.2  Data-Flow-Grauhs  and  Petri-Nets 


The  MCN  is,  clearly,  a  data-flow-graph  [18]  with  the  additional 
constraint  that  only  one  token  is  placed  at  every  input  of  the  network,  and 
consequently,  only  one  token  eventually  appears  at  every  output  of  every 
processor.  Thus,  an  MCN  is  safe  by  definition.  In  spite  of  this 


i.  Non-Executable  Network 


b.  Equivalent  Executable  Network 


Transforaation  of  a  Non-Executable  Network 


into  an  Equivalent  Executable  One 


observation  every  da te-f low-graph  (aafe  or  nnaafe)  can  be  converted  into  an 
MCN,  as  long  as  every  firing  of  a  vertex  in  the  flow-graph  removes  one  token 
from  every  inpnt  line  and  adds  one  token  to  every  ontpnt  line.  This 
constraint  implies  that  the  data-flow-graph  can  be  converted  into  a  block 
diagram  involving  only  delay  elements,  advance  elements  and  memoryless  maps. 
This  block-diagram  can  in  torn  be  converted  into  a  (not  necessarily 
execntable)  MCN.  The  execntabil i ty  condition,  when  transformed  back  to  the 
data-flow-graph  domain  becomes  a  cycle  stun  test,  as  described  in  [19]. 

Petri-nets  are  more  general  than  data-flow-graphs.  They  allow  two 
different  kinds  of  vertices,  known  as  places  and  conditions.  Conditions 
correspond  to  oar  concept  of  processors,  while  places  are  combinations  of 
multiple  sources  and  sinks  and  thus  have  no  counterpart  in  the  MCN  model. 
Petri-nets  whose  places  have  at  most  one  input  and  at  most  one  output  are, 
in  fact,  data-flow-graphs  (also  known  as  marked  graphs  [20])  and  can  be 
converted  into  MCNs. 
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Most  high-level-language  computer  programs  can  be  converted  with  little 
difficulty  into  MCNs.  Each  assignment  statement  of  the  program  becomes  a 
processor  in  the  corresponding  MCN.  Program  variables  are  mapped  into 
network  variables  according  to  the  following  rules: 

(i)  Each  program  variable,  say  x,  is  mapped  into  several  network 
variables,  denoted  by  Xj,  Xj,  etc. 

(ii)  An  occurrence  of  a  program  variable  x  in  the  right-hand-side 
of  an  assignment  statement  is  mapped  into  the  same  network 
variable  Xj  as  the  preceding  occurrence  of  the  same  variable 
in  the  program. 

(iii)  An  occurrence  of  a  program  variable  x  in  the  left-hand-side  of 

an  assignment  statement  is  mapped  into  a  new  network  variable, 
i.e.,  into  if  the  most  recent  occurrence  was  mapped  into 
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Recursions  (do-loops)  are  napped  into  sequential  compositions  of  identical 
processors,  each  processor  corresponding  to  one  step  of  the  recursions.  The 
napping  of  conditional  recursions  ('if*  and  'while'  statements)  is  somewhat 
more  complicated  and  will  not  be  described  here.  A  separate  technical  memo 
will  be  devoted  to  the  details  of  converting  computer  programs  and  other 
descriptions  into  MCNs,  and  vice  versa. 

The  conversion  of  an  MCN  into  a  computer  program  is  straightforward: 
Each  processor  is  mapped  into  several  assignment  statements,  and  each 
network  variable  is  mapped  into  a  program  variable. 

2.5.4  Summery 

The  preceding  analysis  has  shown  that  MCNs  are  essentially  equivalent 
to  computer  programs,  to  block  diagrams  involving  f inite-state-bloeks,  and 
to  a  subclass  of  Petri-nets  (marked  graphs).  The  major  distinction  between 
MCNs  and  most  other  representations  is  the  embedding  of  the  notion  of 
executability  into  the  MCN  model  itself.  Thus,  the  only  way  to  design  non¬ 
executable  MCNs  is  by  the  introduction  of  cycles  in  the  network 
architecture.  Moreover,  the  test  for  executability  is  very  easy  to  carry 
out  and  can  be  included  in  any  compiler  for  MCN  representations.  It  is  much 
easier,  on  the  other  hand,  to  design  malfunctioning  Petri-nets  or  computer 
programs,  and  much  more  difficult  to  detect  the  errors  in  the  design. 

2.6  FORMAL  LANGUAGE  REPRESENTATION  OF  MCNs 

To  facilitate  the  application  of  the  MCN  model  to  both  VLSI  hardware 
design  and  software  engineering  it  is  necessary  to  develop  a  formal  language 
version  of  the  model,  which  preserves  the  convenience  and  simplicity  of  the 
graph-theoretic  formulation.  Such  a  formal  language  representation  should 
not  include  more  information  that  provided  by  the  network  graph.  In 
particular,  it  should  involve  no  details  pertaining  to  the  implementation  of 
the  MCN  in  a  particular  type  of  hardware.  The  matching  between  the 
requirements  of  an  MCN  model  and  the  resources  provided  by  a  particular 
machine  (e.g.,  sequential  computer,  dataflow  computer,  programmable 


wavefront  array)  should  be  carried  out  by  the  compiler,  not  by  the 
user/programmer.  This  will  aignif icantly  simplify  the  coding  phase  of  MCN 
models  and  eliminate  most  of  the  common  programming  errors. 

To  achieve  the  objective  stated  above  the  language  should  be  capable  of 
describing  the  two  ingredients  of  the  MCN  model,  variables  and  processors, 
and  nothing  else.  It  has  to  be  a  single  assignment  language,  with  each 
variable  carrying  its  own  name.  Only  two  types  of  statements  will  be 
allowed:  one  for  describing  the  interconnection  between  processors,  the 
other  for  describing  the  functional  characteristics  of  each  processor. 
Regular  interconnection  patterns  will  be  described  by  indexed  loops.  The 
sequential  order  of  instructions  in  a  program  can  be  arbitrary  and  has 
nothing  to  do  with  the  order  of  execution,  which  will  be  determined  by  the 
compiler  in  accordance  with  the  precedence  relation  of  the  MCN,  as  well  as 
the  available  storage  and  computing  resources. 

The  purpose  of  imposing  such  severe  limitations  upon  the  syntax  of  the 
proposed  language  is  to  eliminate  all  flexibility  in  the  translation  of  an 
MCN  model  into  a  computer  program.  Decisions  about  the  atructure  of  the  MCN 
model  for  a  given  aignal  processing  problem  have  to  be  made  prior  to  the 
coding  atage.  Decisions  about  allocation  of  storage  to  variables  and 
computing  resourcea  to  computations  have  to  be  made  after  the  coding  stage, 
and  preferably,  by  the  compiler.  This  means  that  the  coding  stage  itself 
can  also  be  automated  in  the  future,  enabling  the  user  to  specify  his 
designs  interactively  by  'drawing'  the  MCN  model  on  a  computer  terminal. 

Several  languagea  have  already  been  designed  for  modeling  of  parallel 
algorithms/architectures.  Some  of  these  focus  upon  the  physical  aspects  of 
hardware  implementation  and  almost  completely  lack  the  functional 
characteristica  necessary  to  specify  the  algorithm.  Others  focus  upon 
functional  characteristics  with  little  attention  paid  to  structure.  Only  a 
few  languagea,  like  CRYSTAL  [<],  MDFL  [7]  and  SIGNAL  [23]  maintain  the 
balance  between  structure  and  function.  Our  approach  combines  ideas  from 
these  and  aeveral  other  languages  with  some  unique  concepts  that  emerged 
from  the  reaearch  on  MCN  models. 

The  principles  underlying  the  construction  of  a  formal  language  for  MCN 
models  are  demonstrated  by  the  following  example 


Figure  2-7.  The  MCN  'Example' 


The  corresponding  MCN  program  is 

MCN  EXAMPLE  (X1.Y1.Z1.W1;  X2.Y2.Z2.12) 
BEGIN 

AM  <X1,Y1;X2.YTEM) 

AN  (Z1.V1;V2.ZTEM) 

AS  (YTEM.ZTEM; Y2.Z2) 

END  EXAMPLE 

PROC  AM  (X.Y;A.M) 

BEGIN 
A.—X+Y 
M:«X*Y 
END  AM 

PROC  AS  (X. Y; A. S) 

BEGIN 


A:*X+Y 


The  unique  features  of  the  language  are: 

1)  Single  assignment  -  each  variable  has  its  own  name. 

2)  Modularity  -  each  procedure  is  self  contained  and  can  be  compiled 
and  verif led  independently  of  the  other  procedures. 

3)  Hierarchy  -  there  are  three  levels  of  specification:  (i)  Networks 
(MCN) ,  which  consist  of  another  network  or  of  atomic  processors 
(PHOC);  ( ii)  Processors  (PROC),  which  consist  of  assignment 
statements;  and  (iii)  Variables,  which  may  be  of  the  types  commonly 
used  in  conventional  computer  languages,  and  are  used  to  construct 
assignment  statements. 

4)  Information  hiding  -  the  components  of  the  MCN  program  unit  are 
specified  only  in  terms  of  their  inputs  and  outputs,  without  any 
details  about  their  internal  structure. 

5)  Modifiability  and  localization  -  the  inners  of  every  program  unit 
can  be  modified  without  affecting  the  correctness  of  other  units. 
The  correctness  of  the  modified  unit  can  be  tested  without 
reference  to  other  units. 

Notice  that  the  order  of  program  units  as  well  as  the  order  of  assignment 
statements  in  a  processor  is  immaterial.  This  is  made  possible  by  the 
single  assignment  convention  which  associates  one  variable  with  every  arc  of 
the  corresponding  MCN  graph.  The  names  of  processors  and  networks,  on  the 
other  hand,  can  be  duplicated  to  indicate  identical  inner  structure  (e.g., 
there  are  two  'AM'  processors  in  the  network  'EXAMPLE'). 

A  formal  language  representation  of  an  MCN  model  provides  no 
information  about  the  order  in  which  the  computations  implied  by  the  model 
will  be  ezeouted.  Following  the  precedence  relation  specified  by  the  model, 
the  computations  can  be  arranged  in  layers,  or  wavefronts.  The  computations 
belonging  to  one  layer  can  be  executed  in  an  arbitrary  order,  or  even  all  in 
parallel.  On  the  other  hand,  the  execution  of  the  (i-l)-th  layer  must 
precede  that  of  the  i-th  layer.  The  reformulation  of  the  MCN  model  in  terms 
of  layers,  which  was  introduced  in  Section  2.4,  emphasizes  the  sequential- 
parallel  nature  of  every  MCN  model,  and  serves  as  an  intermediate  step 
between  the  network-type  character  of  the  MCN  model  and  the  purely 
sequential  nature  of  conventional  computer  languages.  This  wavefront 
representation  also  assists  in  determining  variables  that  can  use  the  same 


storage  area,  and  in  allocating  physical  resources  to  computations  when  the 
number  of  available  processors  is  less  than  that  required  for  the  fully 
parallel  execution  of  a  given  layer.  Thus,  the  transformation  of  an  MCN 
model  into  a  layered  format  plays  a  central  role  in  the  compilation  of  MCN 
programs  for  execution  on  a  specified  machine. 


2.7  SUMMARY 

A  unified  model  for  multilevel  description  and  analysis  of  parallel 
algorithms  and  architectures  has  been  developed.  The  model  is  general 
enough  to  describe  any  computational  algorithm  and  to  explicitly  exhibit  its 
parallelism. 

The  basic  descriptive  tool  is  a  precedence  graph,  which  indicates  all 
possible  implementations  of  the  algorithm  in  either  software  or  hardware.  A 
aimple  structural  condition  (no  cycles  in  the  graph  model)  guarantees  that 
the  corresponding  algorithm  is  executable.  Different  implementations  of  the 
same  algorithm  correspond  to  different  orderings  of  the  vertices  (processing 
elements)  of  the  precedence  graph.  Translation  of  software  programs,  data¬ 
flow  graphs  and  x-transform  descriptions  of  algorithms/architectures  into 
precedence  graphs  and  vice  versa  is  easy  to  carry  out. 

The  precedence  graph  approach  clearly  demonstrates  the  fact  that 
storage  (memory)  requirements  are  determined  by  the  implementation  chosen 
for  an  algorithm  rather  than  by  the  algorithm  itself.  Thus,  the  model  of  a 
single  computing  cell  need  not  include  storage  at  all.  The  most  general 
cell  is  therefore  a  multiple  input,  multiple  output  map,  viz., 

Y  -  f(U.  6) 

where  U  denotes  inputs  from  other  cells,  6  denotes  parameters,  which 
may  be  locally  stored  in  the  cell,  and  Y  denotes  outputs.  A  cell  is 
called: 

linear  -  when  f  is  linear  in  U  (but  not  necessarily  in  0) 


time  invariant  -  when  parameters  are  time-invariant 


Notice  that  a  cell  aay  be,  in  general,  nonlinear  and  tiae-varying.  However, 
it  is  always  causal. 

An  actual  hardware  iapleaentation  of  a  cell  involves  also  a  delay  of 
the  output  signal  Y,  consisting  of  s  computation  delay  (the  tiae  required 
to  coapute  Y  once  D  is  available)  and  a  propagation  delay  (the  tiae 
required  for  the  output  signal  Y  to  reach  its  destination).  The  analysis 
of  such  delays  and  their  effects  upon  the  throughput  of  the  MCN  is  presented 
in  the  following  section. 


SECTION  3 


STRUCTURAL  ANALYSIS  OF  MCNs 


The  notion  of  execution,  defined  in  the  previous  section,  provides 
several  quantitative  characterizations  of  the  MCN  architecture.  In 
particular,  it  can  be  used  to  number  the  processors  of  an  MCN  and  to 
introduce  concepts  of  dimensionality.  A  refinement  of  the  notion  of 
execution  leads  to  tine  schedules  and  to  the  formulation  of  composition 
rnles  for  execution  times.  Thus,  the  objective  of  associating  a  unique 
execution  time  with  every  output  of  an  MCN  is  achieved.  The  third 
objective,  that  of  associating  a  unique  measure  of  complexity  with  each  MCN, 
has  yet  to  be  accomplished.  Currently  there  is  no  consensus  even  upon  the 
measure  of  complexity  for  a  single  processor,  let  alone  for  a  network  of 
processors.  Some  progress  has  been  made  in  characterizing  complexity  in 
terms  of  'area.'  but  more  research  is  required  before  commonly-accepted 
rnles  for  composition  of  complexity  can  be  formulated.  For  this  reason  the 
topic  of  complexity  will  not  be  considered  in  the  sequel. 


3.1  NUMBERING  OF  VARIABLES  AND  PROCESSORS 

The  concept  of  execution,  which  was  defined  in  Section  2.3,  defines  a 
numbering  E(x)  on  the  variables  of  an  MCN,  viz., 

x  «  St  <“>  E(x)  -  i  (3.1) 

Since  the  partitioning  (S^)  and  the  numbering  E(  )  determine  each  other 
and  convey  equivalent  information,  we  shall  call  the  function  E(  )  itself 
an  execution.  Several  variables  may  share  the  same  value  of  E(  ) ,  which 
means  they  belong  to  the  same  level  S^.  If  each  level  of  an  execution 
contains  exactly  one  variable  the  execution  is  called  sequential.  The 
function  B(x)  defines,  in  this  case,  a  sequential  ordering  of  the 


variables  and  of  the  processors  comprising  the  MCN.  The  numbering  of 
variables  determined  by  an  execution  E(  )  is  consistent  with  the 
precedence  relation  since  we  clearly  have 


E(x>  1  1  +  max  {E(y) ;  y  a  a(x>} 

y 

Similarly,  we  can  define  a  numbering  of  the  processors  by 


E(p)  :*  max  (E(x);  x  e  X^(p)} 
x 


(3.2) 


(3.3) 


The  value  of  E(p)  indicates  the  earliest  instant  at  which  all  inputs  of 
the  processor  p  become  available.  We  can  also  define  a  precedence 
relation  for  processors,  vix.. 


q  ->  p 


if  there  exists  a  directed  path  from  q  to  p.  This  relation,  in  turn, 
determines  the  ancestry  set  a(p)  of  each  processor  by 


o(p)  :*  {q;  q  s  P,  q  ->  p) 


(3.4) 


Xt  can  now  be  seen  that  an  analog  of  (3.2)  holds  for  the  numbering  of 
processors,  vix.. 


E(p)  1  +  max  (E(q)  ;  q  s  u(p)} 
4 


(3.5) 


Since  a  typical  MCN  has  fewer  processors  than  variables,  the  numbering  of 
processors  is  a  more  convenient  tool  for  structural  analysis  of  an  MCN. 


3.2  DIMENSIONALITY  AND  ORDER 


A  family  of  sequential  executions  (E^(  ))  on  a  given  MCN  is  called 


representative  if 


••  J 


36 


(3.6) 


q  e  a(p)  <■*=>  Ej(q)  <  Ej(p),  si  1  i 

Notice  that  a  representative  family  can  never  consist  of  a  single  execntion 
(except  in  the  case  of  a  purely  sequential  MOi)  because  there  exist  always 
two  processors  q,  p  such  that  E(q)  (  E(p)  even  though  q  does  not 
precede  p  (nor  does  p  precede  q) .  The  following  result  shows  that 
every  MCN  has  at  least  one  representative  family. 

Theorem  3 .1 

The  collection  of  all  sequential  execntions  of  a  given  MCN  is  a 
representative  family. 

Proof ; 

By  the  definition  of  execntion 
q  e  a  (p)  «>  E(q)  <  E(p) 

for  every  execntion  (sequential  or  not).  To  prove  the  converse  assnme  that 
(Ej(  ))  is  the  collection  of  all  sequential  executions,  and  that  for  some 
processors  p,  q 

Ej(q)  <  E^p),  all  i 

Clearly  p  cannot  precede  q,  bnt  they  may  be  incomparable.  In  this  case 
there  exists  a  non-sequential  execntion  E(  )  such  that 

E(p)  -  E(q) 

Since  every  execution  oan  be  transformed  into  a  sequential  one  by 
arbitrarily  ordering  the  variables  in  each  level,  it  follows  that  E  ean  be 
converted  into  a  sequential  execntion,  say  Eq,  such  thst 
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F.  (q)  >  E  (p) 
o  o 


This,  however,  contradicts  the  assumptions.  Hence,  p,  q  cannot  be 
incomparable  and  we  oust  have  q  e  a  (p). 


A  representative  family  with  the  smallest  number  of  members  will  be 
called  a  basis  (it  need  not  be  unique).  The  cardinality  of  bases  is  defined 
as  the  dimensionality  of  the  MCN  in  consideration.  The  members  of  a  basis 
(E^(  )}  define  a  coordinate  basis  for  the  network,  such  that  the 
coordinates  of  a  processor  p  are  (E^(p),  E2(p),  . ...  En<p)} .  Notice  that 
the  dimensionality  of  a  network  is  bounded  below  by  the  dimensionality  of 
all  its  subnetworks,  so  adding  long  chains  of  processors  to  a  2-dimensional 
network  cannot  redure  the  overall  dimension  below  2  (Figure  3-1) . 

Every  basis  ol  .-a  MC,  determines  a  unique  non-sequential  execution 
obtained  by  ordering  the  processors  according  to  the  sum  of  their  basis 
coordinates.  For  the  example  of  Figure  3-1  this  execution  is 


(1).  (2,3)  (4)  (5)  .  .  .  (n) 


The  order  of  a  basis  is  defined  as  the  number  of  variables  in  the  largest 
layer  of  the  parallel  execution  determined  by  the  basis.  For  the  example 
above  the  order  is  2  since  there  is  a  set  of  2  processors  in  the  parallel 
execution.  Since  an  MCN  may  have  many  bases  it  has  no  unique  order. 
Moreover,  each  execution  E  (not  necessarily  associated  with  a  basis)  has 
its  own  order,  defined  by 

ord  (E)  :*=  max  {p;  E(p)  *  i}  (3.7) 

i 

Executions  can  be  implemented  in  hardware  by  mapping  each  layer  into  a 
single  iteration,  with  all  the  processors  in  the  layer  implemented  in 
parallel.  The  order  of  an  execution,  which  is  the  number  of  processors  in 
the  largest  layer,  is  therefore  a  measure  of  the  hardware  complexity  of  such 
an  implementation. 
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Once  we  have  coordinate  baaet  at  our  dispoaal  we  can  apply  metric 
argument*  to  the  representation  of  an  MCN.  For  instance,  we  can  define 
distances  between  processors  and  introduce  the  concept  of  looal 
communication  between  processors  in  a  rigorous  manner.  However,  more 
research  is  required  to  establish  the  properties  of  metrics  defined  by 
coordinate  bases;  in  particular,  it  is  not  yet  clear  how  the  choice  of  the 
coordinate  basis  affects  the  metric. 

3.3  SCHEDULES.  DELAY  AND  THROUGHPUT 

The  execution  of  an  MCN  represents  only  its  precedence  relation  and 
does  not  take  into  account  the  actual  time  required  for  execution.  The 
evaluation  of  each  variable  requires  a  certain  amount  of  execution  time  when 
implemented  in  hardware.  Since  each  output  of  a  processor  may  involve  a 
different  execution  delay,  execution  times  have  to  be  specified  for  arcs  of 
the  precedence  graph  rather  than  for  the  vertices.  The  execution  time 
associated  with  a  variable  x  will  be  denoted  in  the  sequel  by  T(x).  This 
is  the  time  required  to  evaluate  x  from  its  immediate  ancestors 
(“  parents),  i.e.,  from  the  variables  that  serve  as  inputs  to  the  processor 
whose  output  is  the  variable  x. 

The  incorporation  of  time  delays  into  the  notion  of  execution  results 
in  *  schedule,  which  is  formally  defined  as  a  function  t(x)  that  satisfies 
the  constraint 

t(x)  2.  T(x)  +  max  (x(y)  ;  y  a  a(x)}  (3.8a) 

y 

and  is  zero  for  the  network  inputs,  viz., 

x  e  Xt(P)  — : >  x ( x)  -  0  (3.8b) 

This  constraint  guarantees,  in  particular,  that  the  parents  of  x  will  be 
available  at  time  x(x).  Thus,  schedules  „■>.  refinements  of  executions.  In 
particular,  with  every  execution  E(  )  we  can  associate  a  schedule  t(  ) 
by  choosing 

t(x)  -  max  (x(y)  4  T(x)  ;  E(y)  -=  E(x)  -  1)  (3.9) 

y 
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Such  schedule*  are,  generally,  non-ainiaal  in  the  sense  that  some  operations 
have  all  their  inputs  available  before  their  scheduled  execution  tine,  i.e., 
(3.8)  holds  with  a  strict  inequality  for  such  operations.  A  schedule  which 
satisfies  (3.8)  with  equality  for  every  x  e  X  is  called  alnlaal . 

Niniaal  schedules  are  iaportant  because  they  characterise  the  fastest 
executions  of  a  given  MOJ.  This  property  is  Bade  explicit  by  the  following 
result. 


Theorea  3.2 

Every  structurally  finite  MCN  has  a  unique  ainiaal  schedule  t(  ).  The 
ainiaal  schedule  satisfies 

x(x)  <  r(x)  (3.10) 

for  every  x  a  X  and  for  every  schedule  t(  ). 


El£of: 

Since  by  Theorea  2,1  a  structurally  finite  MCN  has  a  countable  nuaber 
of  variables,  the  result  can  be  established  by  induction.  Thus,  let  S  be 
a  subset  of  X  that  is  closed  under  the  ancestry  relation,  naaely  for  every 
x  s  S  we  aust  have  a(x)CI  S.  As suae  that  S  has  already  been  assigned  a 
ainiaal  schedule  r(  )  and  that  this  schedule  also  satisfies  (3.10). 

Choose  a  variable  y  not  in  S  and  consider  the  augaented  network 
detexained  by  SV^/u(y).  We  need  to  show  that  x (  )  can  be  extended  to 
this  augaented  network  and  that  it  will  satisfy  both  (3.8)  and  (3.10)  The 
schedule  t(  )  is  now  extended  to  a(y)  in  the  following  aanner: 

(i)  Assign  t(x)  >0  to  every  x  s  o(y)  that  has  no  ancestors. 

(ii)  Identify  the  collection  of  variables  for  which  all  ancestors  have 
already  been  assigned  a  schedule  (this  set  is  never  eapty) . 

Assign  to  each  one  of  these  variables  the  schedule 

t(x)  T(x)  +  aax  {t(w>;  w  s  o(x)) 
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For  every  *  e  a(z)f  either  x (w)  *  0  or  w  e  S,  so  that 
x(w)  <  x(w)  for  any  schedule  x(  ).  Since  any  schedule  t(  ) 
has  to  satisfy  (3.8)  «e  obtain 

x(z)  2  T(z)  +  an  (x(w);  w  e  a(z)) 

w 

2  T(z)  +  max  (x(w);  w  e  a(z>)  *  x(z) 

w 

which  proves  that  (3.10)  is  preserved  in  this  step. 

(iii)  Augment  S,  viz., 

S  :=  S\Ju(y) 
and  go  bach  to  (ii). 


The  repeated  application  of  this  procednre  results  in  the  assignment  of 
x(x)  to  every  variable  of  the  MCN.  The  resulting  schedule  is  minimal, 
i.e.,  it  satisfies  (3.8)  with  an  equality,  unique  (by  construction)  and  also 
satisfies  (3.10). 


As  with  executions,  we  can  also  define  schedules  for  processors.  The 
schedule  of  a  processor  p  e  P  is  defined  as 


x(p)  max  {  x(x);  x  a  Xj(p)  )  (3.11) 

x 

in  analogy  with  3,3.  It  is  the  instant  at  which  all  input  variables  of  the 
processor  become  available.  Some  of  the  inputs  of  the  processor  may  become 
available  earlier  and  need,  therefore,  storage  or  buffering  until  they  can 
actually  be  used.  A  variable  x  is  called  critical  with  respect  to  a  given 
schedule  x(  )  if 


x  s  Xj(p)  “«>  x(x)  «  x(p)  (3.12) 

and  non-critical  otherwise.  Thus,  the  schedule  of  each  processor  is 
determined  by  the  schedule  of  its  critical  inputs.  Since  non-critical 
variables  require  storage  the  general  objective  of  scheduling  is  to  reduce 
the  total  storage  requirements. 
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Storage  it  aeasured  by  the  product  of  voltage  (e.g.,  the  nuaber  of  bits 
to  be  stored)  and  duration.  The  duration  of  storage  for  a  variable 
x  e  X^(p)  Is  the  difference  between  the  tiae  it  becoaes  available  and  the 
aost  recent  Instant  It  still  needs  to  be  available,  i.e., 

sax  (xty)  -  Tty);  y  e  Xc(p),  x  ->  y)  -  x(x) 

This  interval  will  be  Bininized  if  we  choose  the  difference  rty)  -  T(y)  as 
short  as  possible.  In  view  of  (3.8),  we  have  to  choose  xty)  -  Tty)  *  xtp). 
naaelythe  Biniaal  schedule  also  Biniaizes  the  storage  requireaents  of  the 
network.  The  ainiaal  schedule  still  has  both  critical  and  non-critical 
variables.  However,  only  the  critical  ones  determine  the  schedule,  as 
deBOnstrated  by  the  following  result. 


14W  ?.? 

Every  processor  in  a  structurally  finite  MCN  is  connected  to  a  network 
input  by  a  finite  path  whose  variables  (arcs)  are  critical  under  the  Biniaal 
schedule. 


E£oo£: 

The  definition  of  a  critical  variable  iapliea  that  every  processor  has 
at  least  one  critical  input  variable.  The  critical  path  is  obtained  by 
tracing  back  through  the  critical  inputs  of  the  preceding  processors.  Since 
the  ancestry  of  each  processor  is  finite,  this  procedure  terminates  in  a 
finite  nnaber  of  steps  when  the  path  reaches  a  network  inpnt. 


A3 


Corollary  3.3 


The  nininal  schedule  of  a  processor  equals  the  length  (sub  of 
processing  delays)  of  a  critical  path  that  connects  a  network  input  to  this 
processor. 

The  corollary  inplies  an  interesting  principle  for  the  physical  design 
of  hardware  iapleaentations — critical  paths  need  to  be  considered  first  so 
that  the  length  of  the  physical  connections  along  the  path  can  be  ainiaized. 
Non-critical  paths  can  accoaiaodate  extra  propagation  delays  and  can,  there¬ 
fore,  be  designed  later. 

The  construction  of  a  schedule  is  based  npon  the  assnaption  (3.5b)  that 
all  MCN  inputs  are  available  at  the  very  beginning.  Thus,  a  zero  schedule 
was  assuaed  in  (3.8)  for  every  MCN  input,  i.e., 

x  e  XjlP)  ==>  x(x)  =  0 

This  is,  however,  inessential,  since  many  of  these  inputs  will  not  be 
required  until  auch  later.  The  scheduling  of  the  network  inputs  can  be 
aodified,  once  a  schedule  x(  }  has  been  deterained,  to  reflect  the 
earliest  instant  they  are  required  in  the  execution.  Thus,  for  every 
x  a  Xj(P)  redefine  the  schedule  of  the  inputs  to  be 

x  a  Xj(P)  **=>  x(x)  x(p)  where  x  a  Xj(p)  (3.13) 

and  no  buffering,  or  storage,  of  the  inputs  will  be  necessary.  This  is 
particularly  iaportant  if  not  all  the  inputs  can  be  aade  available  in  the 
aaae  instant,  e.g.,  in  real  tiae  processing  of  tine-series.  Notice  that 
this  aodification  in  the  scheduling  of  inputs  does  not  affect  the  schedule 
of  any  other  variable  in  the  network.  This  is  so  because  only  non-critical 
input  variables  are  adjusted.  The  aeaning  of  (3.13)  is  that  all  network 
inputs  are  aade  critical  to  reduce  the  storage  requireaents  of  the  network. 

The  schedule  of  output  variables  is  coaaonly  known  as  delay.  The  delay 
of  x  is  the  tine  hat  has  elapsed  froa  the  aoaent  sons  variable  in  o(x) 
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becomes  avail  able  until  the  moment  the  variable  i  itself  becomes 
available.  This  is,  clearly, 

r(x)  -  min  (t(y);  y  e  o(x)} 

and  in  many  cases  it  will  be  equal  to  t(x).  In  typical  signal  processing 
applications  the  delay  of  outputs  usually  increases  without  limit  as  more 
and  more  inputs  are  applied  to  the  processor  and  more  and  more  outputs  are 
evaluated.  In  such  cases  one  is  interested  in  the  rate  of  output 
evaluation,  cosmnonly  known  as  throughput,  rather  than  in  the  delay  of  the 
outputs.  The  throughput  is  roughly  the  number  of  MCN  outputs  that  are 
evaluated  in  a  unit  of  time.  Since  this  rate  may  vary,  we  need  a  more 
rigorous  definition  based  on  the  concept  of  schedule. 

Every  schedule  determines  a  temporal  ordering  of  the  MCN  variables  (it 
need  not  be  sequential),  which  is  consistent  with  the  precedence  relation 
between  variables.  In  order  to  quantify  the  rate  at  which  output  variables 
are  evaluated,  we  define  the  output  counting  function 

N  (t)  :■  number  of  elements  in  the  set  (3.14) 

o 

(y;  y  e  XQ(P).  *(y)  <  *) 

The  input  counting  function  can  be  similarly  defined,  viz., 

N^(t)  : *  number  of  elements  in  the  set  (3.15) 

(y,  y  «  Xjd*).  x(y)  <  x) 

We  can  now  plot  the  counting  function  N(t)  as  a  function  of  t  for  both 
the  inputs  and  the  outputs  (Figure  3-2).  The  functions  Nj(t);  No(t)  are, 
of  course,  staircase  functions  (indicated  by  broken  lines  in  Figure  3-2)  and 
can  be  upper-bounded  by  a  pair  of  continuous,  piecewise-linear  curves 
(indicated  by  the  solid  lines  in  Figure  3-2).  The  slope  of  these 
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curves  (which  sre  always  strictly  monotone  increasing)  is  a  measure  of  the 
rate  of  information  flow  into  the  network  and  ont  of  it,  and  will  be  called 
the  input  and  output  throughput,  respectively.  A  schedule  is  called  regular 
when  both  its  input  and  output  throughput  are  periodic  with  the  same  period 
(and,  in  particular  when  both  throughputs  are  constant).  An  MCN  is  called 
temporal lv-reaular  when  its  minimal  schedule  is  regular.  Many  temporally- 
regular  networks  have  equal  input  and  output  throughputs,  but  this  need  not 
be  true,  in  general. 


3.4  SPACE-TIME  DIAGRAMS 

The  continuous- time  character  of  the  schedule  is  best  demonstrated  by 
introducing  a  time-axis  into  the  graphical  description  of  an  MCN.  The 
vertices  are  arranged  so  that  the  vertical  displacement  from  the  top  of  the 
diagram  to  the  location  of  any  given  vertex  p  indicates,  on  an  appropriate 
scale,  the  value  of  the  schedule  t(p)  for  this  vertex  (Figure  3-3,  compare 
with  Figure  2-1).  This  space-time  diaarsm  has  several  interesting 
properties: 

(1)  All  arcs  point  downward. 

(2)  The  vertical  displacement  of  an  arc  indicates  the  total  execution 
time  associated  with  this  operation,  including  any  buffering  time 
that  may  be  required  beyond  the  actual  execution  time  T(x). 

(3)  Changes  in  local  execution  times  sre  easily  accounted  for  by 
shifting  the  corresponding  vertices  up  or  down  along  the  time 
scale.  The  global  effects  of  such  shifts  are  clearly  depicted  by 
the  diagram. 

(4)  Non-executable  MCNs  (with  xero  or  negative  execution  times)  can 
still  be  described  by  the  diagram.  This  is  useful  to  establish 
equivalence  between  various  descriptions  of  the  same  MCN  (e.g., 
precedence  graphs  and  signal  flow  graphs). 
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The  collection  of  processors  (vertices)  with  the  stse  schedule  form  an 
isochrone . 

The  execution  of  a  network  according  to  a  given  schedule  nay  now  be 
Interpreted  as  the  propagation  of  a  single  wavefront  of  activity  through  the 
architecture.  The  location  of  the  activity  wavefront  at  any  given  instant 
is  indicated  by  the  corresponding  isochrone.  Observe  that  the  isochrones 
are  parallel  straight  lines  (or  parallel  planes  if  the  precedence  graph  is 
described  in  a  three  dimensional  space)  and  do  not  intersect.  Also  notice 
that  the  inputs  and  outputs  of  a  temporal lv-reaular  network  are  evenly 
distributed  in  time  (i.e.,  along  the  vertical  axis  of  the  space-time 
diagram).  These  properties  are  particularly  significant  for  the  analysis  of 
iterative  MCNs,  which  is  carried  ont  in  Section  4. 

As  an  illustration  of  the  equivalence  between  various  descriptions  of 
the  same  UCN  consider  the  block-diagram  of  an  HR  filter  (Figure  3-4a).  The 
corresponding  MCN  (Figure  3-4b)  can  be  rearranged  in  many  ways  without 
modifying  the  architecture  of  the  network.  However,  if  Figure  3-4b  is 
interpreted  as  a  space-time  diagram  (with  time  being  the  vertical  axis), 
snch  modifications  result  in  different  schedules  and  also  in  different 
block-diagrams.  In  particular,  the  delay  elements  can  be  moved  to  the  lower 
path  (Figure  3-S)  or  split  between  the  two  signal  paths  (Figure  3-6).  The 
latter  version  is  the  only  one  that  can  be  implemented  in  hardware  because 
it  contains  only  downward-pointing  arrows;  the  other  two  versions 
require  instantaneous  evaluation  of  each  variable  associated  with  a 
horixontal  arrow.  The  third  description  mskes  it  also  clear  that  the  time 
interval  between  successive  application  of  inputs  is  equal  to  two  delay 
units.  It  is  also  possible  to  associate  unequal  computing  times  with  the 
forward  and  backward  propagation  through  each  block.  After  all,  the  forward 
path  only  feeds  informstion  through  the  block  while  the  backward  path 
involves  a  multiply-and-add  operation.  The  resulting  space-time  diagram 
(Figure  3-7)  has  delays  T^.  Tfe  associated  with  the  forward  and  backward 
paths,  and  the  input  interval  is  clearly  Tf  +  T^.  Notice  that  the  block 
diagram  description  involves  two  different  delay  blocks:  This  is  known  as  a 
multirate  implementation  [81.  The  throughput  rates  are,  nevertheless,  equal 
to  (Tf  ♦  T^)~*  for  both  the  input  and  the  output. 

The  same  technique  can  be  applied  to  analyse  the  several  proposed 
systolic-array-like  implementations  for  matrix  multiplication:  the 


hexagonal  array  of  H.T.  Kong  (5],  the  improved  hexagonal  array  of  Veiaer  and 
Davia  [4],  the  wavefront  array  proceaaor  of  S.Y.  Rung  [7]  and  the  direct 
forai  realization  of  S.  Rao  [10].  Detaila  are  provided  in  Appendix  E. 

The  analyaia  of  the  previona  example  a  make*  it  clear  that  the  common 
MCN  architecture  ahared  by  all  the  repreaentationa  of  a  given  proceaaing 
ay  a  tern  indvcea  certain  invarianta.  For  instance,  the  total  number  of 
outputa  of  each  proceaaor  remaina  invariant,  even  though  in  aome 
repreaentationa  aome  of  theae  outputa  are  connected  to  a  local  memory  rather 
than  to  a  nearby  proceaaor  (Figure  3-8).  The  came  ia  true  for  the  total 
number  of  inputa  of  each  proceaaor.  Notice  that  the  blockc  in  Figure  3-8a 
are  ctill  the  came  as  in  Figure  3~4a.  including  the  orientation  of  paths 
(one  forward,  one  backward).  On  the  other  hand,  the  roles  of  the  blocks  are 
quite  different;  in  particular,  outputa  are  obtained  from  the  local  memories 
rather  than  from  the  left-most  block  alone,  as  in  Figure  3~4a. 
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a.  Block-diagraa 


b.  Space-tine  diagram 

Figure  3-6.  Schematic  Description  #3  of  so  IB  Filter 


>.  Block  diagr 


b.  Space-tiae  diagraa 

Figure  3-7.  Nultlrate  Implementation  of  an  I  IS  Filter  (T,  <  T.  ) 


3 .5  SUMMARY 


Techniques  for  snslysis  of  pipelinsbility,  schedules,  and  throughput  in 
systolic-array-like  configurations  have  been  developed  based  on  the  MCN 
representation  of  parallel  algorithais/architectures .  Architecture 
evaluation  was  also  based  on  graph  theoretic  properties  of  the  MCN  model: 

The  dimensionality,  degree  of  parallelism,  and  throughput  of  a  given 
architecture  are  all  determined  by  analysis  of  its  precedence  graph. 

The  major  difficulty  in  the  analysis  of  computing  networks  lies  in  the 
translation  of  low-level  input-output  relations  to  high-level  ones,  and  vice 
versa.  We  have  shown  that  the  problem  reduces  to  the  factorization  of  the 
global  (high-level)  input-output  map  into  a  product  of  purely  parallel  maps, 
corresponding  to  the  concept  of  wavefront  propagation  in  the  network.  More 
specifically,  the  global  input-output  map  is  a  sequential  composition  of  the 
maps  corresponding  to  the  layers  of  an  execution. 
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SECTION  4 


ITERATIVE  AND  COMPLETELY  REGULAR  NETWORKS 


4.1  ITERATIVE  MCNs  AND  HARDWARE  ARCHITECTURES 


An  MCN  is  called  iterative  when  it  can  be  described  as  a  sequential 
composition  of  identical  subnetworks,  i.e.. 


<6 


ne  twork 
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Each  of  the  identical  components  will  be  called  an  Iteration.  One 
reason  for  this  name  ia  that  the  MCN  can  be  executed  by  implementing  a 
aingle  component  rG  in  hardware  and  aimulating  a  aeqnential  compoaition  of 
such  components  by  spreading  the  execution  of  the  components  in  time.  The 
motivation  for  studying  iterative  MCNs  is  that  most  signal-processing 
algorithms  and.  in  particular,  all  systolic-array-like  architectures  can  be 
described  by  such  networks.  Observe  that  every  block-diagram  representation 
corresponds  to  an  iterative  MCN.  The  iterative  structure  induces  certain 
regularity  constraints  upon  the  MCN  which  lead  to  a  simplified 
representation. 

The  minimal  schedules  of  iterative  networks  are,  clearly,  periodic  with 
the  same  period  for  input  and  output  schedules.  Thus  iterative  MCNs  are 
temporal ly-regular.  In  addition,  they  are  functionally- regular  in  the  sense 
that  each  iteration  involves  the  same  function  if.  Consequently,  their 
properties  can  be  determined  by  analyzing  a  single  iteration.  For  instance, 
the  entire  network  is  acyclic  (hence  executable)  if  a  single  iteration  is 
acyclic.  In  particular,  the  executabil ity  of  z-transform  representations  of 
iterative  MCNs  is  tested  by  removing  all  separators  and  examining  the 
remaining  directed  graph  for  occurrence  of  cycles  (see  also  [9]). 

Similarly,  the  (minimal)  schedule  of  the  network  can  be  determined  by 
considering  a  single  iteration. 


Iterative  MCNs  are  commonly  described  by  recursion  equations  (or 
equivalently  by  z-transform  diagrams),  data-flow  diagrams  (marked  graphs), 
or  by  'do-loops'  in  high-level  programming  languages.  While  precedence 
graphs  of  iterative  networks  still  indicate  all  possible  executions, 
recursion  equations  restrict  the  choice  of  execution  to  one  or  at  most  two 
possibilities  (Figure  4-1) .  And  while  precedence  graphs  avoid  the  pitfall 
of  non-executable  iteration  by  explicitly  describing  each  iteration  as  part 
of  an  executable  (acyclic)  precedence  graph,  data-flow  diagrams  contain 
cycles  which  may  cause  the  entire  MCN  to  be  non-executable. 

Since  all  iterations  are  identical,  the  schedules  of  every  two 
consecutive  iterations  differ  by  the  same  constant,  which  we  shall  call  the 
input  interval.  The  input  interval  is  the  period  of  the  input  schedule  or, 
equivalently,  of  the  input  throughput,  as  well  as  of  the  output  schedule 
(recall  that  iterative  MCNs  are  temporally  regular).  It  determines  an  upper 
bound  on  the  rate  at  which  inputs  are  applied  to  the  network  (lower  rates 
are  permitted,  but  require  additional  buffering). 

The  time-space  diagram  of  an  iterative  MCN  corresponds  to  its  minimal 
schedule  and  is,  therefore,  periodic.  It  is  important  to  notice  that  the 
period  (=  input  interval)  is,  in  general,  shorter  than  the  time  required  to 
complete  the  execution  of  a  single  iteration  («  the  iteration  delay).  This 
means  that  hardware  realizations  of  the  MCN  can  be  pipelined:  New  inputs 
can  be  applied  before  the  processing  of  previous  inputs  has  been  completed. 

The  functional  regularity  of  iterative  MCNs  implies  that  they  can  be 
implemented  in  special  purpose  VLSI  hardware  by  mapping  the  precedence  graph 
of  a  sinale  iteration  directly  into  silicon.  Each  processor  is  mapped  into 
a  cell  ('processing  element')  and  precedence  relations  are  mapped  into 
interconnections  between  cells.  Neither  translation  nor  hardware 
compilation  are  required  to  accomplish  this  mapping  since  the  hardware 
architecture  is  an  exact  image  of  a  single  layer  of  the  network 
architecture.  An  execution  is  now  interpreted  as  the  propagation  of  a 
sequence  of  wavefronts  through  the  hardware  rather  than  the  propagation  of  a 
single  activity  wavefront  through  the  iterative  MCN  (Figure  4-2).  The  time 
spacing  of  these  wavefronts  equals  the  period  of  the  underlying  MCN  minimal 
schedule. 


Since  a  tingle  layer  of  the  MCN  it  naed  to  'simulate'  the  entire 
network  each  procetaor  it  activated  many  timet  and  each  arc  of  the  hardware 
architecture  corretpondt  to  a  time-series  of  variablet  rather  than  to  a 
tingle  variable.  Thit  raises  a  deaign  problem  of  a  new  kind:  It  it 
neeetaary  to  gnarantee  that  variablet  do  not  ditappear  before  they  have  been 
uaed  to  evaluate  their  tucceieort.  There  are  three  different  tolutiont  to 
thit  problem: 

(1)  Iterative  execution:  A  new  iteration  it  initiated  only  after  the 
execution  of  the  preceding  iteration  hat  been  completed.  Thit 
meant  that  the  input  interval  it  extended  (by  buffering  of 
intermediate  reaulta)  to  the  length  of  the  iteration  delay,  and 
the  time-overlap  between  iterationa  it  completely  eliminated. 

(2)  Scheduled  execution:  The  (minimal)  achednle  of  the  network  it 
determined  in  advance  and  execution  it  carried  out  according  to 
tchedule.  Buffering  it  provided  to  guarantee  the  availability  of 
inputa  to  proceaaort  on  tchedule  (only  non-critical  variablet  need 
to  be  buffered). 

(3)  Self-timed  execution:  Proceaaort  are  activated  aa  toon  as  their 
inputs  become  available.  Acknowledgment  signals  ( *  hand-shaking') 
are  uaed  to  guarantee  the  correct  transfer  of  data  between 
proceasors. 

While  acheduled  execution  offera  the  shortest  execution  time  and  requires  a 
fairly  simple  control  system,  it  is  extremely  sensitive  to  scheduling 
perturbations.  Such  perturbations,  which  are  caused  by  clock-skewing  and 
local  variations  in  execution  timet,  may  result  in  lota  of  synchronization 
between  cells  and  a  complete  failure  of  the  system.  Iterative  execution  it 
insensitive  to  scheduling  perturbations  and  requires  a  very  simple  control 
system,  but  wastes  processing  time  since  the  hardware  it  idle  most  of  the 
time.  Self-timed  execution  provides  a  nice  tradeoff  between  these  two 
extremes:  Its  execution  time  is  only  slightly  longer  than  the  theoretical 
minimum  achieved  by  scheduled  execution;  and  the  control  system  it  requires 
has  about  the  same  complexity  as  the  timing  system  for  scheduled  execution. 
It  is  interesting  to  observe  that  the  conditions  for  self-timed  execution 
[161,  [17]  coincide  with  the  concept  of  admissible  composition,  which  was 
shown  to  be  the  necessary  and  sufficient  condition  for  executability  in 
general.  Thus,  every  MCN  can  be  implemented  as  a  self-timed  system. 
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The  notion  of  self-timed  execution  anggeats  the  introdnction  of  self- 
tiaed  block-diagraaa.  These  are  obtained  by  reaoving  the  delay-eleaenta 
froa  a  conventional  block-diagraa  and  replacing  then  by  direct  connections. 
The  hardware  iapleaentation  of  anch  aelf-tiaed  diagraaa  ia  atraightforward 
provided  two  aiaple  rule a  are  obeyed: 

(i)  Each  cell  ia  activated  aa  aoon  aa  all  its  inpnts  become  available 
and  deactivated  as  soon  as  all  its  outputs  have  been  evaluated. 

(ii)  Each  input  variable  is  accompanied  by  an  acknowledgment  line. 

Each  input  port  (sink)  acknowledges  the  arrival  of  a  new  input 
variable  to  the  processor  that  evaluated  this  variable.  The 
acknowledgment  ia  sent  when  the  processor  connected  to  the  input 
port  becomes  activated. 

These  rules  assume  that  each  cell  is  provided  with  sufficient  memory  to 
store  its  output  variables  until  they  become  acknowledged. 

The  acknowledgment  of  inputs  associated  with  self-timed  implementations 
can  (and  should)  be  reflected  in  the  space-time  diagram  of  the  network. 
Acknowledgment  signals  are  just  one  more  set  of  variables  in  the  network, 
and  are  represented  in  the  space-time  diagraaa  by  arcs,  as  any  other 
variable.  For  instance,  a  cascade  connection  of  (identical)  processors 
(Figure  4 -3 a)  has  an  input  interval  of  x^  +  x,  where  T ^  is  the  execution 
time  of  the  processor  and  is  the  delay  between  the  reception  of  an 

acknowledgment  signal  froa  the  subsequent  processor  and  the  transmission  of 
an  acknowledgment  signal  to  the  preceding  processor  (Figure  4-3b).  The 
interval  x ^  includes  the  propagation  tine  through  the  processor  and  the 
connecting  wires  plus  the  time  required  to  carry  out  checks  on  the  input 
data  (parity,  error  detection,  fault  detection,  etc.).  Notice  that  the  need 
for  explicit  acknowledgment  can  be  eliminated  in  many  cases,  e.g.,  when 
there  is  an  information  carrying  path  along  the  cascade  in  the  backward 
direction. 

The  horizontal  dimension  of  space-time  diagrams  can  now  be  interpreted 
as  hardware:  Processors  located  along  the  same  horizontal  line 
(isochrone)  represent  computations  that  need  to  be  carried  out 


.-'I 


% 


»! 


Self-Tiaed  Block-Diagr 


simul taneously  and  must,  therefore,  be  iapleaented  in  parallel  hardware.  Ve 
•hall  adopt  the  convention  of  interpreting  the  vertical  dimension  of  space¬ 
time  diagrams  as  pure  time:  Processors  located  along  the  same  vertical  line 
will  represent  computations  that  are  carried  out  by  the  sane  processing 
element  but  during  different  (non-overlapping)  intervals  of  time.  Thus,  for 
instance,  the  MCN  of  Figure  4-4  can  be  implemented  in  hardware  with  four 
processing  elements  (Figure  4-4b).  Each  vertical  column  of  processors  in 
the  space-time  diagram  of  the  MCN  (Figure  4-4a)  is  mapped  into  a  single 
hardware  cell;  connections  between  columns  are  mapped  into  physical 
connections  between  cells  and  connections  within  columns  are  implemented  by 
locally  storing  intermediate  results  inside  the  appropriate  cells. 

Self-timed  or  scheduled  execution  is,  indeed,  faster  than  iterative 
execution  only  if  the  input  interval  is  shorter  than  the  execution  time  of  a 
single  iteration.  An  implementation  of  such  an  execution  will  initiate  a 
new  iteration  before  the  execution  of  the  previous  iteration  has  been 
completed.  Such  implementations  deserve  to  be  called  pipelined.  Thus, 
iterative  executions  are  never  pipelined,  while  self-timed  (or  scheduled) 
executions  are  pipelined  only  for  pipelinable  MCNs. 

Notice  that  the  input  interval  is  uniquely  defined  for  every 
temporally-regular  MCN.  but  the  iteration  delay  (=  execution  time  of  a 
•ingle  iteration)  depends  upon  the  partitioning  of  the  MCN  into  iterations. 
Since  this  partitioning  need  not  be  unique,  an  iterative  MCN  may  have 
several  hardware  realixations,  each  with  a  different  iteration  delay.  Thus, 
pipelining  is  primarily  a  property  of  a  given  hardware  realisation.  An  MCN 
is  considered  ninelinable  if  it  has  at  least  one  pipelined  realization. 

Pipel inability  is  most  frequently  associated  with  completely  regular  MCNs 
(“  systolic-array-like  networks).  The  connection  between  complete 
regularity  and  pipel inability  is  discussed  in  the  following  section. 
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4.2  COMPLETELY  REGULAR  MCNs 


A  completely  regular  MO!  Is  one  that  can  be  represented  by  a  regular 
multidimensional  grid,  and  in  which  all  input-output  maps  associated  with 
the  vertices  are  identical.  Thus,  the  vertices  of  a  completely  regular  MCN 
can  be  mapped  into  points  of  the  multidimensional  grid  Zn  in  the 
n-dimensional  Euclidean  space  Rn,  where  Z  denotes  the  set  of  integers; 
the  arcs  of  a  completely  regular  MCN  become  vectors  (n-tuples  of  real 
numbers)  representing  the  directed  straight  lines  connecting  points  of  the 
grid  Zn.  Clearly,  not  all  points  in  Z°  correspond  to  vertices  of  the 


MCN.  Those  that  do  determine  the  domain 


of  the  MCN  in  Z  .  The 


requirement  of  complete  regularity  translates  into  the  statement  that  the 
vectors  (arcs)  emanating  from  any  point  (vertex)  in  do  not  depend  upon 
the  choice  of  vertex.  Consequently,  the  entire  MCN  is  characterized  by: 


(i)  the  set  of  dependence  vectors  {d^J  emanating  from  a  single 


vertex; 


(ii)  the  domain  T  C!  Z  ; 


(iii)  the  input-output  map  f 

f :  (z. ,  . . . ,  x  )  — — )  (y. ,  » . . ,  y  ) 

l  p  l  p 

associated  with  every  vertex  in  the  domain  T. 

A  curious  consequence  of  this  definition  is  that  the  input-output  map  f 
has  the  same  number  of  inputs  and  outputs,  since  the  number  of  arcs 
emanating  from  a  point  in  T  is  always  the  same  as  the  number  of  arcs 
converging  to  a  point. 

Not  every  set  of  dependence  vectors  (d^)  determines  a  valid  MCN.  For 
instance,  the  directed  graph  representing  an  MCN  has  to  be  acyclic.  In 
terms  of  dependence  vectors  this  means  that  it  is  impossible  to  find 
positive  integers  {k^}  such  that  ■  0.  Another  requirement  is 


r 
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that  the  ancestry  of  every  vertex  v  e  "  (i.e.,  the  aet  of  all  points  from 

which  v  can  be  reached)  has  to  be  finite.  This  constraint  is  trivial  if 
T  is  a  finite  set;  however,  if  T  is  infinite  (as  is  often  the  case  with 
signal  processing  algorithms)  this  constraint  implies  that  T  has  to  be 

bounded  in  the  directions  {-d,}. 

1  3 

In  the  sequel  we  shall  focus  upon  completely  regular  MCNs  in  Z  , 

because  such  MCNs  correspond  to  space-time  representation  of  planar 

systolic-array-like  architectures  (see  [23]  -  [31]).  Ve  shall  impose  the 

constraint  of  causality  resulting  from  the  association  of  'time*  with  one  of 

3 

the  coordinate  axes  in  Z  and  examine  the  flow  of  data  through  the 
architecture  in  terms  of  the  dependence  vectors  eharaeterixing  the  MCN. 


.2.1  Soace-Time  Representations 


MCNs  in  Z  are  characterised  by  3-dimensional  dependence  vectors 
{d^J,  which  we  shall  represent  by  row  vectors  of  length  3.  The  collection 


of  all  dependence  vectors 


D  :=  ^i1?-! 


(4.1) 


forms  a  pxn  matrix,  which  we  shall  call  the  dependence  matrix.  The 
boundary  of  the  domain  T  can  always  be  described  as  a  polyhedron.  It  will 
be  sufficient  for  our  purposes  to  consider  only  convex  polyhedra,  and  in 
fact,  only  those  that  can  be  characterised  in  terms  of  the  dependence 
vectors  (see  Section  5.4  for  a  further  discussion  of  this  choice). 

The  interpretation  of  MCNs  in  Z^  as  space-time  representations  of 
hardware  architectures  imposes  the  additional  constraint  of  causality: 
every  dependence  vector  must  have  a  positive  time  coordinate,  since 
computation  and  propagation  of  data  cannot  be  accomplished  in  xero  time. 
Moreover,  since  data  cannot  propagate  faster  than  the  speed  of 
electromagnetic  waves  in  metallic  conductors,  the  directions  of  dependence 
vectors  must  lie  within  a  certain  cone,  the  time-like  cone  in  the  space-time 
continuum.  By  appropriate  scaling  of  space  and  time  coordinates  we  can 
reduce  this  condition  to  the  requirement 
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[0  0  1]  21 


(4.2) 


which  urns  that  the  third  coordinate  of  d^  Boat  be  (an  integer)  larger  or 
eqnal  to  1 . 

The  aasociation  of  tine  with  the  third  coordinate  of  dependence  vectors 
allows  us  to  express  the  finite  ancestry  condition  in  siaple  form.  The 
exclusion  of  ancestors  that  are  infinitely  reaote  from  a  given  vertex  in  the 
doaain  F  is  equivalent  to  the  requireaent  that  T  be  a  subspace  of  the 
positive  half  space  of  2?,  i.e.,  the  half  space  corresponding  to  non- 
negative  tiae  coordinates.  Moreover,  since  hardware  nust  always  be  finite, 
the  spatial  extent  of  T  nust  be  bounded.  Thus,  the  only  direction  in 
which  T  aay  reaain  unbounded  is  that  of  positive  tine,  corresponding  to  a 
coaputation  that  continues  indefinitely  in  tiae  (e.g..  a  filtering  of  an 
infinite  tine-series),  but  produces  results  (outputs)  at  regular  intervals. 

Vertices  in  T  that  share  the  saae  spatial  coordinates  are  considered 
as  representing  the  saae  hardware  processor  at  different  instances  in  tine. 
Regularity  iapliea  that  such  isospatial  vertices  are  spread  in  tine  at 
regular  intervals.  This  interval,  which  is  the  saae  for  all  processors, 
will  be  called  the  periodicity  index  of  the  architecture.  The  periodicity 
index  corresponding  to  a  given  dependence  natrix  D  is  the  saallest 
solution  n  of  the  equation 

qD  *  *  [0  0  1]  (4.3) 

where  q  is  any  row  vector  with  integer  (possibly  negative)  entries.  To 
prove  this  result  we  notice  that  qD  is  an  integer  coabination  of 
dependence  vectors;  aoreover,  if  v(x^.y^.t^)  and  Y^x2,^2’^2^  are  two 
distinct  vertices  in  T,  then  the  vector  connecting  these  vertices  can 
always  be  expressed  in  the  fora  qD  for  an  appropriate  (possibly  nonunique) 
row  vector  q.  If  the  two  vertices  share  the  saae  spatial  coordinates,  then 
their  interconnecting  vector  is  colinear  with  [0  0  1],  and  so  (4.3)  is 

satisfied  for  soae  q,n.  Finally,  the  saallest  teaporal  displacement  is 
obtained  when  n  is  ainiaized  in  (4.3).  The  periodicity  index  n  can, 
actually,  be  evaluated  without  an  exhaustive  search  through  all  possible 
integer  vectors  of  q  that  satisfy  (4.3),  ss  is  deaonstrated  in  Section 
4.2.2. 
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The  most  important  attribute  of  the  apace-tiae  representation  of  a 
completely  regular  MCN  is  the  invariance  of  the  MCN  under  coordinate 
transformations  in  space-time.  This  is  so  because  coordinate 
transformations  do  not  affect  the  interconnection  pattern  of  the  space-time 
representation,  and  consequently  leave  the  correaponding  directed  graph 
unaltered.  In  the  case  of  regular  space-time  representations  it  is 
sufficient  to  consider  the  effect  of  linear  coordinate  transformations;  this 
is  done  in  detail  in  Sections  S  and  6. 


4.2.2  Spatial  Projection  of  MCNs  in 


The  first  two  coordinates  in  a  three-dimensional  space-time  can  be 
interpreted  as  physical  space.  When  a  space-time  representation  is 
projected  into  the  plane  formed  by  the  first  two  coordinates,  vertices 
represent  computing  agents  (i.e.,  processors)  and  arcs  represent  physical 
interconnections  (i.e..  vires).  The  projection  amounts  to  the  truncation  of 
each  dependence  vector  to  its  first  two  coordinates,  viz.. 


D 

s 


0 

1 

0 


(4.4) 


The  truncated  dependence  matrix  D#  ('s'  stands  for  'spatial')  is  usually 
sufficient  to  characterize  the  architecture,  since  we  commonly  assume  that 
each  dependence  vector  represents  a  computation  that  requires  a  unit  of 
time,  and  consequently 


D  - 


(4.5) 


This  assumption  is  violated  only  when  D  has  a  periodicity  index  n(D)  >  1 
and,  in  addition,  D  contains  a  dependence  vector  of  the  form  [0  0  t] . 

This  dependence  vector  is  truncated  to  [0  0) ,  so  x  cannot  be  recovered 

unless  x  ■  n  or  x  *  1.  These,  in  fact,  are  the  only  two  possible  values 
for  x  as  explained  in  Section  6.4. 
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The  truncated  dependence  aatrix  can  be  pictorial ly  repreaented  by  a 
conventional  block-diagraa  such  aa  Figure  4-5.  Each  truncated  dependence 
vector  ia  repreaented  by  an  arc  with  the  appropriate  apatial  displacement, 
while  truncated  dependence  vectors  of  the  fora  [0  0],  which  correspond 

to  local  aemory,  are  represented  by  self-arcs. 


Figure  4-5.  Exaaple  of  a  Regular  Hardware  Architecture 


The  truncated  equivalent  of  (4.3)  becones 

qD  -  0  (4.6) 

a 

ao  that  every  feasible  choice  of  q  corresponds  to  an  undirected  loop  in 
the  2-dimensional  block-diagram  representation.  Thus,  every  feaaible  q  is 
obtained  by  considering  all  possible  loops  in  the  block-diagram 
representation.  If  there  are  no  self-loops  on  vertices,  then  D#  contains 
no  zero  row  and  (4.5)  holds.  Consequently,  by  (4.3), 

qD[0  0  11*  -  q[l  1  .  .  .  1]*  -  n 

so  n  is  obtained  by  adding  up  the  entries  of  q.  This  is,  in  fact,  done 
by  counting  each  arc  along  the  loop  as  1  if  it  coincides  with  the 
orientation  of  the  loop  and  as  -1  if  it  points  in  the  reverse  direction. 
Since  the  smallest  value  of  sr  is  required,  only  the  shortest  loops  need  to 
be  considered.  Ve  shall  show  in  Section  5.3  that  n  never  exceeds  3  and  is 
seldom  larger  than  1. 


4.3  NODULAR  DECOMPOSITION  OF  MCN  MODELS 

The  conversion  of  a  given  algorithm  into  an  MCN  model  is  based  upon  the 
assumption  that  the  fundamental  building  blocks — the  processors — are 
implementable  in  hardware.  This  is  indeed  so  if  each  processor  represents  a 
few  scalar  operations,  which  can  be  handled  by  any  contemporary  computing 
agent.  However,  signal  processing  algorithms  are  rarely  specified  in  this 
convenient  form;  most  often  they  are  represented  by  block-diagrams  whose 
blocks  involve  multivariable  manipulations,  and  in  particular,  matrix 
algebra.  The  construction  of  a  computer  program  or  a  hardware 
implementation  for  such  MCNs  requires  to  decompose  each  multichannel 
proceaaor  into  a  subnetwork  of  scalar  processors.  One  way  to  achieve  thia 
decomposition  is  by  storing  standard  subnetworks  for  commonly  used 
operations  such  as  matrix  addition,  multiplication  and  inversion  in  a 
library  and  invoke  this  information  whenever  the  need  arises.  However,  the 
experience  with  signal  processing  schemes  indicates  that  better  results  are 
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obtained  by  matching  tbe  method  of  decomposition  to  the  structure  of  the 
problem.  By  e  judicious  epplicstion  of  the  principle  of  modular 
decomposition  [32]  we  obtain  completely  regular  MCNs  which  are  perfectly 
suited  for  implementation  in  VLSI,  and  have  a  much  higher  throughput  than 
those  obtained  by  mapping  matrix  operations  directly  into  parallel  hardware. 

There  is,  at  present,  no  simple  test  to  establish  the  applicability  of 
the  principle  of  modular  decomposition  to  a  given  multichannel  operation. 
Vhen  the  operation  is  linear  in  all  its  operands  the  necessary  and 
sufficient  conditions  for  modular  decomposabil ity  can  be  stated  in  simple 
terms  as  described  below. 


4.3.1  Modular  Decomposition  of  Linear  Multivariable  Filters 


Let  H(z)  be  the  transfer  function  of  a  multivariable  filter  with  an 
equal  number  of  inputs  and  outputs.  Suppose  that  the  inputs  have  been 
*S8r*8*t*d  into  three  multichannel  inputs  x,  u^,  u^  and  the  outputs  have 
been  respectively  aggregated  into  w,  y^,  y2  (Figure  4-6).  Thus, 


yj,(*> 

'*(*)  ' 

y2(*) 

*  H(z) 

u^z) 

w(z) 

u2(z) 

(4.7) 


Suppose  that  there  exist  transfer  functions  H^(z),  ( x)  such  that 


H(x) 


*P  °  ‘ 

’^(z)  0  ’ 

0  H2(z) 

,°  V 

(4.8) 


where  p  is  the  number  of  channels  in  u^  and  q  is  the  number  of  channels 
in  Ug.  Then  the  filter  can  be,  clearly,  decomposed  in  the  form  described 
by  Figure  4-6.  Conversely,  if  such  a  decomposition  exists,  then  (4.8)  will 
hold.  Thus,  the  existence  of  the  factorization  (4.8)  is  a  necessary  and 
sufficient  condition  for  the  decomposabil ity  of  a  multivariable  filter  as 
described  by  Figure  4-6. 


To  sake  the  decoaposability  condition  aore  explicit  we  ehall  describe 
the  transfer  functions  HjU)  in  tens  of  blocks,  viz.. 


Hi(z) 


Aj(z> 

Ci(z) 


B4(z) 

Djfz) 


i  *=  1,2 


(4.9) 


The  corresponding  decoaposition  of  H(z)  yields,  via  (4.8),  (4.9)  the 
identity 


'Hh(z) 

Hi2(s) 

^(z)' 

A1(z) 

B1(z) 

0 

H(z)  :«■ 

H21(z) 

h22(*) 

H23U) 

* 

A2(z)C1(z) 

A2(z)D1(z) 

B2(z) 

H3i(z) 

H32(z) 

H33(Z) 

^(zlCjtz) 

C2(z)D2(z) 

Vz) 

(4.10) 

This  aeans  that  soae  eleaents  of  H^(z)  can  be  uniquely  detenined  froa 
those  of  the  given  H(z),  for  instance 

A1(z)  -  Hu(x),  02(*>  -  h33<*>*  «tc. 

Since  the  blocks  B^z),  C^(z)  are  square  (while  A^,  need  not  be 

square),  we  also  obtain,  assuaing  nonsingularity  of  square  transfer 
functions 


A2<Z)  “  H21(*)C11<Z)  (4.11a) 

which  iaplies 

H21<*)Ci1<Z>Dl<Z>  "  H22<Z>  (4.11b) 

Siailarly 

C2<z)  -  H31(z)C"1(z)  (4.12a) 


which  iaplies 


H31(*,C1  (‘)D1U)  "  H32(i) 


(4.12b) 


A>  a  consequence,  the  el'uents  of  H(z)  cannot.be  arbitrary.  In  fact. 


H^z)  H22(z) 

-  0  (4.13a) 

H31U)  B32(z)  -1 


which  means  that  for  all  z 


Ih21<.)  h22(.) 


h2j(,)  hlw 


where  n  is  the  nnaiber  of  channels  in  the  signals  x,  v,  w.  In  fact,  if  we 
inaiat  that  square  transfer  functions  are,  generically,  invertible,  then 
Hj2(z)  is  an  nxn  nonsingular  matrix,  so  that 


Vz)  H22(z) 


H3i(z)  H32(z) 


(4.13b) 


Conwersely,  if  (4.13b)  holds,  then 


C-,(,)D1(.)  -  .>>■„(.) 


(4.14) 


Choosing  C2(z)  arbitrarily  we  obtain  A2(z),  C2(z),  Dj(z)  via  (4.11a), 
(4.12a)  and  (4.14),  respectively. 

In  summary,  a  necessary  and  sufficient  condition  for  a  decomposition  of 
the  form  (4.10)  to  exist  is  that  the  rank  condition  (4.13b)  holds  and  in 
addition 


e13(.)  -  o 


(4.15) 
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Given  a  transfer  function  H( z)  that  satisfies  the  decomposability 
conditions  we  can  always  compute  H^(z),  B^(z).  In  fact,  we  may  choose 
(a)  arbitrarily,  and  the  simplest  choice  is  (^(z)  “  I.  This  yields  the 
decomposition 


H(z) 


i,  0  ® 

0  B21U>  B23<,) 

0  "»<•>  "„<*> 


>U<" 


0 

0 


(4.16) 


0 


0 
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SECTION  5 
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CLASSIFICATION  OF  ARCHITECTURES 
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Completely  regular  MCN s  were  characterized  in  the  previous  section  in 
terms  of  their  dependence  vectors.  It  was  also  indicated  that  MCNs  with 
different  dependence  vectors  may  nevertheless  be  equivalent,  namely  they 
will  have  equivalent  space-time  representations.  The  equivalence  of 
completely  regular  MCNs  is  easy  to  verify,  since  it  amounts  to  the  existence 
of  a  nonsingnlar  linear  transformation  relating  the  dependence  matrices  of 
the  MCNs  in  consideration. 

The  study  of  equivalence  can  be  carried  out  at  several  different  levels 
of  abstraction.  At  the  lowest  (most  detailed)  level  each  completely  regular 
MCN  is  represented  by  a  dependence  matrix 

D  :«  (5.1) 


where  d^  are  row  vectors  of  length  3  whose  first  two  coordinates  represent 
the  planar  space  of  integrated  circuits  and  the  third  coordinate  represents 
time.  Thus,  for  instance,  the  MCN  of  Figure  5-1  is  characterized  by  the 
dependence  matrix 


D 


1 

0 

1 


0  1 
1  1 
1  1. 


Notice  that  the  time  coordinate  of  all  three  dependence  vectors  equals  to  1, 
reflecting  the  assumption  that  each  dependence  vector  represents  a 
computation  that  requires  a  unit  of  time.  This  assumption  can,  of  course, 
be  modified  to  incorporate  computations  with  unequal  processing  times. 

Notice  also  that  the  direction  of  dependence  vectors  coincides  with  that  of 

the  arrowa  in  Figure  5-1,  pointing  toward  the  successors  of  a  given 

processor,  rather  than  toward  the  predecessors  of  the  same  processor,  as  in  [25]. 


*  i 
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Figure  5-1.  Example  of  a  Completely  Regular  MCN 


At  the  intermediate  level  of  abstraction  only  the  spatial  coordinates 
of  each  dependence  vector  are  considered.  This  results  in  the  elimination 
of  the  third  column  of  the  dependence  matrix  D,  resulting  in  the  truncated 
dependence  matrix  D( 


O' 

1 

1. 


for  the  example  of  Figure  5-1.  Ve  shall  show  in  the  following  section  that 
the  truncated  dependence  matrix  D(  provides,  in  fact,  a  complete,  albeit 
implicit,  characterisation  of  the  MCN.  This  characterization  can  be 
transformed  in  a  unique  manner  into  the  explicit  characterization  D. 

At  the  higheat  level  of  abstraction  only  the  topology  of  the  hardware 
is  considered.  This  means  that  the  directed  graph  representing  the  flow  of 
data  is  replaced  by  the  corresponding  non-directed  graph.  Thus,  for 
instance,  the  MCN  of  Figure  5-1  and  that  of  Figure  5-2  are  topologically 
equivalent,  even  though  the  latter  has  a  different  dependence  matrix,  viz. 


Figure  5-2.  A  Completely  Regular  MCN  Which  is  Topologically  Equivalent 

to  that  of  Figure  5-1 

This  section  is  devoted  to  the  study  of  topological  equivalence 
followed  by  the  study  of  architectural  (D^)  equivalence.  The  more 
complicated  topic  of  space-time  equivalence  is  presented  in  the  following 
section,  where  it  is  also  shown  that  distinct  hardware  configurations  may, 
nevertheless,  have  equivalent  space-time  representations. 

5.1  TOPOLOGICAL  EQUIVALENCE 

The  topic  of  topological  equivalence  has  been  studied  by  Miranfcer  and 
Winkler  [25],  who  have  shown  that  there  are  only  three  distinct  topologies 
(Figure  5-3): 

(1)  The  linear  topology,  with  a  single  dependence  vector, 

Ds  -  [1  0] 

(2)  The  rectangular  topology,  with  two  dependence  vectors. 


(3)  The  hexagonal  topology,  with  three  dependence  vectors. 


Every  systolic-array-like  architecture  can  be  related  by  a  linear 
transformation  to  one  of  these  fundamental  topologies.  Also,  it  is 
impossible  to  have  more  than  three  non-colinear  dependence  vectors  in  a 
planar  architecture. 

The  same  conclusion  can  be  reached  by  a  graph-theoretic  argument.  The 
graph  describing  the  hardware  configuration  of  a  completely  regular  MCN  is 
clearly  a  mosaic,  i.e.,  a  planar  graph  in  which  all  faces  are  bounded  the 
same  number  of  edges  and  all  vertices  (except  those  on  the  external  boundary 
of  the  graph)  have  the  same  number  of  incident  edges.  As  is  well  known, 
there  are  only  three  possible  mosaics  [21]:  triangular,  rectangular  and 
hexagonal.  The  triangular  mosaic  has  vertices  of  degree  6  and  coincides 
with  the  hexagonal  topology.  The  rectangular  mosaic  has  vertices  of  degree 
4  and  coincides  with  the  rectangular  topology.  The  hexagonal  mosaic  (Figure 
5-4)  does  not  correspond  to  a  completely  regular  MCN,  since  it  requires  two 
sets  of  dependence  vectors  rather  than  one.  However,  it  can  be  rearranged 
by  combining  pairs  of  adjacent  processors  into  a  single  processor  (Figure  5- 
4b).  so  that  the  resulting  configuration  has  a  rectangular  topology.  Thus, 
there  are  only  two  mosaics  corresponding  to  completely  regular  MCNs,  which 
combined  with  the  linear  configuration  makes  a  total  of  3  fundamental 
topologies. 


5.2  ARCHITECTURAL  EQUIVALENCE 

Each  of  the  interconnecting  vires  in  the  three  fundamental  topologies 

can  be  associated  with  two  direction  vectors,  one  pointing  along  the  wire  in 

one  way,  the  other  in  the  reverse.  This  makes  a  total  of  three 

possibilities  for  each  interconnecting  wire:  (i)  +  d,  (ii)  -d,  and 

(iii)  +  d.  This  means  that  the  linear  topology  results  in  31  ■  3 

2 

architectures,  the  rectangular  topology  in  3  ■  9  architectures  and  the 

3 

hexagonal  topology  in  3  *  27  architectures.  Since  many  of  these 

architectures  are  equivalent,  a  classification  of  the  distinct  architectures 
is  provided  in  Table  3-1.  The  nomenclature  consists  of  a  capital  letter 
(L,  R  or  H)  indicating  the  topology  (linear,  rectangular  or  hexagonal),  a 
digit  indicating  the  number  of  dependence  vectors  and  a  lower  case  letter, 
whenever  required,  to  distinguish  between  architectures  which  have  the  same 
topology  and  the  same  number  of  dependence  vectors  but  are  not  equivalent, 
e.g.,  H3 a  and  H3b .  The  table  lists  all  equivalent  configurations  in  a 
single  row. 


5.3  PERIODICITY  ANALYSIS  AND  THROUGHPUT 

The  occurrence  of  cycles  (i.e.,  closed  loops  of  directed  arcs)  in  the 
directed  graph  repreaenting  a  hardware  architecture  provides  important 
information  about  the  throughput  rate  of  the  architecture.  In  this 
subsection  we  analyse  this  information  and  identify  the  configurations  with 
low  throughput. 

The  periodicity  index  n  of  architectures  has  been  defined  in  Section 
4.2.  It  can  be  computed  either  by  examining  undirected  loops  in  the  graph 
representing  the  architecture  or  by  solving  the  equation 

qD#  «  0  (5.2) 


for  every  posaible  row  vector  t\  with  integer  elements,  and  summing  the 
elements  of  q.  The  periodicity  index  n  equals  the  smallest  of  these 


•urns.  If  no  solution  q  exists,  n  is  defined  to  be  1.  Following  this 
technique  we  conclude  thst  LI,  R2  hsve  no  solution  end  hsve  s  unit 
periodicity  index,  while  other  architectures  hawe  solutions,  as  follows: 


(i) 

L2 

has 

q  «  [1  11 ;  hence  it  *  2. 

(ii) 

R3 

has 

i)  c  [1  1  0])  hence  n  *  2. 

(iii) 

R4 

has 

q  =  [1  1  0  0] ,  [0011];  hence  it  =  2. 

(iv) 

H3  a 

has 

q  *  [1  1  -1];  hence  n  =  1 . 

(v) 

H3b 

has 

q  ■  [1  1  1];  hence  it  =  3. 

(vi) 

H4a 

has 

q  -  [1  1  -1  0],  [0011],  [1101];  hence  it  -  1 

(vii) 

H4b 

has 

q  *  [1  *1  0  1]>  [0011];  hence  n  -  1. 

(viii) 

HS 

has 

q  -  [1  0  1  0  -1],  [1  1  0  0  0],  [0  0  1  1  Oh  hence 

IT  ■ 

1. 

(ix) 

H6 

has 

q  -  [1  0  1  0  -1  0],  [110000],  [001100], 

[0  0  0  0 

1  1] ;  hence  «  *  1. 

In  the  sequel  we  shall  measure  the  throughputs  of  architectures 
relative  to  the  throughput  of  the  linear  architecture  LI  (a  classical 
pipeline).  Since  the  time  interval  between  two  successive  applications  of 
input  equals  the  periodicity  index,  the  relative  throughput  of  a  given 
architecture  is  given  by  the  formula 

r.UtiT.  throughput  -  pe tit,dlc'lts,  ln(i„  (5.3) 

Thus,  the  relative  throughput  of  L2,  R3,  R4  is  1/2  and  that  of  H3b  is 


5.4  BOUNDARY  ANALYSIS 

No  assumption  has  been  made  up  to  this  point  about  the  shape  of  the 
boundary  of  a  given  hardware  architecture.  However,  since  the  shape  of  the 
boundary  is  changed  by  linear  transforaiation  it  has  to  be  taken  into 
consideration  in  the  process  of  classifying  architectures.  As  an  example 
consider  the  6  equivalent  configurations  denoted  by  H3 a  (Table  5-1).  The 
truncated  dependence  matrices  of  the  first  and  third  of  these  configurations 
are  related  by  a  linear  transformation,  viz. 


1  0  ‘  1  ll 

1  1  [-1  oj 


Now  assume  that  the  first  configuration  has  a  rectangular  boundary,  which 
can  be  characterized  by  boundary  matrix 

1  0 

[.  x 

consisting  of  all  dependence  vectors  colinear  with  the  boundary.  The  linear 
transformation  maps  this  boundary  into 


B  -  B 
s  s 


1  1 


-1  0 


1  1 


-1  0 


which  characterizes  a  parallelogram  rather  than  a  rectangle.  Thus,  the 
first  H3a  configuration  with  a  rectangular  boundary  is  equivalent  to  the 
third  H3 a  configuration  with  a  parallelogram  boundary.  It  is  not 
equivalent,  however,  to  the  third  H3a  configuration  with  a  rectangular 
boundary.  Clearly,  we  need  to  reclassify  the  entries  of  Table  5-1  according 
to  both  the  dependence  matrix  and  the  boundary. 

Ve  shall  be  concerned  only  with  boundaries  that  satisfy  the  two 
following  conditions: 

(i)  The  boundary  curve  is  a  closed  convex  polygon 
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L! 


(ii)  Each  se gaunt  of  the  boundary  enrve  ia  col inear  with  aoae 
dependence  vector. 

Thus,  the  only  poaaible  directiona  for  the  segaents  of  the  boundary  enrve 
are  [1  0].  [0  1]  and  [1  1].  Consequently,  there  are  four  possible 
boundary  curves  (Figure  5-5) :  rectangle,  parallelograa,  lower  triangle, 
upper  triangle.  Of  these,  only  the  rectangle-shape  boundary  can  be  applied 
to  the  linear  (1)  and  rectangular  (R)  architectures.  On  the  other  hand,  all 
four  possible  boundaries  can  be  coabined  with  hexagonal  (H)  architectures. 
However,  since  linear  transf oraations  aap  rectangles  into  parallelograas  and 
lower  triangles  into  upper  ones,  we  need  only  consider  the  coabination  of 
each  hexagonal  entry  of  Table  5-1  with  either  a  rectangular  or  a  triangular 
boundary. 


a.  Rectangle 


b.  Parallelograa 


d.  Upper  Triangle 


Figure  5-5.  Fundamental  Boundary  Curvet 
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With  rectangular  boundaries  we  need  to  consider  matrices  of  tbe  form 


D 

'd  ' 

s 

s 

*= 

B 

\ 

I 

s 

1 

Clearly 


D 

-D 

-D 

s 

.  (-1)  = 

s 

s 

I  . 

-I 

_  I  _ 

which  shows  that  the  reversal  of  all  dependence  vectors  does  not  produce  a 
new  configuration.  The  6  entries  in  each  one  of  the  rows  H3 a,  H4a,  B4b,  H5 
of  Table  5-1  can,  therefore,  be  considered  as  3  pairs  of  conjugate 
conf igurationa.  Of  these,  the  second  and  third  pair  are  still  equivalent 
when  combined  with  rectangular  boundaries,  but  the  first  pair  is  different. 
Thus,  the  entries  of  Table  5-1,  when  combined  with  rectangular  boundaries, 
can  be  reclassified  as  in  Table  5-2.  This  time  each  architecture  is 
specified  by  its  matrix  rather  than  by  a  pictorial  description  as  in 

Table  5-1. 

Similarly,  we  can  combine  each  hexagonal  entry  of  Table  5-1  with  a 
lower  triangular  boundary.  This  will  again  produce  two  distinct 
architectures  for  each  one  of  the  rows  H3 a,  H4 a,  H4b,  H5.  However,  there 
is  no  need  to  do  it  explicitly,  since  the  resulting  configurations  can 
always  be  obtained  by  'cutting'  the  appropriate  hexagonal  topology  combined 
with  a  rectanaular  boundary  along  the  main  diagonal.  Thus,  it  will  be 
sufficient  to  focus  in  the  sequel  upon  rectangular  boundaries  alone. 


5.5  SUMMARY 

Systolic-array-like  architectures  have  been  classified  by  topology, 
interconnection  pattern  and  shape  of  boundary.  We  have  shown  that  there  are 
only  15  distinct  (non-equivalent)  architectures  (see  Table  5-2).  We  have 
also  shown  that  it  is  sufficient  to  consider  only  rectangular  boundaries 
which  are  of  practical  importance  in  the  process  of  VLSI  layout. 


A  genealogical  chart  (Figure  5-6)  shows  which  architectures  are 
contained  in  any  given  architecture.  In  particular,  it  shows  that  H6  is 
the  'universal  architecture'  for  systolic  arrays,  containing  every  possible 
architecture  with  a  snail er  number  of  dependence  vectors. 


Figure  5-6.  Genealogical  Chart  for  Architectures 


SECTION  6 


CLASSIFICATION  OF  SPACE- TIME  REPRESENTATIONS 


The  space-tiae  representation  of  a  coapletely  regular  MCN  was 
characterized  in  the  previous  section  by  the  dependence  aatriz  D.  The 
hardware  conf ignration  was  obtained  by  focnsing  upon  the  spatial  coordinates 
of  the  dependence  rectors,  which  resnlted  in  the  truncated  dependence  aatrix 
D#.  It  was  observed  that  the  tenporal  coordinate  of  all  the  architectures 
described  in  Section  5  was  always  equal  to  1,  viz.. 


D  = 


1  ‘ 

e 

e 

e 

1  . 


(6.1) 


so  that  the  dependence  aatrix  D  can  be  easily  reconstructed  for  any  given 
D(  via  (6.1).  The  properties  of  the  corresponding  space-tiae  diagraa  can 
then  be  deduced  by  analysis  of  the  dependence  aatrix  D. 


6.1  THE  FUNDAMENTAL  SPACE-TIME  CONFIGURATIONS 

Each  of  the  fundaaental  11  architectures  of  Table  5-2  deteraines  a 
fundaaental  space-tiae  configuration.  We  shall  focus  our  attention  upon  the 
dependence  aatrix  alone,  without  considering,  for  the  present,  the  shape  of 
the  boundary  surface.  Thus,  equivalence  between  the  fundaaental  space-tiae 
configurations  is  established  by  relating  the  corresponding  dependence 
aatrices  by  linear  transforaations.  A  siaple  analysis  (see  Appendix  F) 
shows  that  every  dependence  aatrix  with  2  vectors  can  be  transformed  into 
the  equivalent  (canonical)  fora 

10  0 
0  10 
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and  every  dependence  natrix  with  3  vectors  can  be  transforned  into  the 
equivalent  fora 


Consequently,  L2  ~  R2  and  R3  ~  H3 a  ~  H3b  where  the  tilde  (~)  denotes 
equivalence.  For  dependence  natrices  with  nore  than  3  vectors  it  is 
convenient  to  establish  first  a  (nonunique)  canonical  equivalent,  i.e.,  an 
equivalent  dependence  aatrix  whose  first  three  rows  are  the  identity  aatrix, 
vix. , 


Soae  canonical-fora  equivalents  are  listed  in  Table  6-1.  The  full  list  of 
canonical  equivalents  will  be  discussed  in  later  sections  in  conjunction 
with  the  specification  of  boundary  surfaces  in  the  three-diaenaional  space- 
tine  continuua. 


6.2  ARCHITECTURES  WITH  LOCAL  MEMORY 

The  preceding  analysis  was  based  upon  the  assuaption  that  processors 
transnit  the  results  of  coapntations  to  their  inaediate  neighbors  and  never 
store  then  for  further  use.  However,  nany  applications  do  involve  such 
storage;  this  is  true,  in  particnlar,  for  adaptive  syatea/paraaeter 
identification  algorithns  that  store  the  identified  parsaeters  in  fixed 
location  within  the  array  and  use  the  signals  that  flow  through  each 
processor  to  tine-update  the  locally  stored  parsaeters.  In  this  section  we 
consider  the  architectures  obtained  by  providing  each  processor  with  a  local 
aenory . 
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Topologically,  local  aeaory  scans  the  addition  of  a  self-loop  to  each 
processor  (Figure  6-1).  The  direction  of  each  interconnecting  link  can 
still  be  chosen  in  3  distinct  ways,  as  explained  in  Section  5.2.  resulting 
in  11  new  architectures  (Table  6-2).  Two  important  observations  have  to  be 
nade  regarding  this  table: 

(i)  The  number  of  dependence  vectors  is  larger  by  one  than  the  nnaber 
of  interconnections.  Thns.  for  instance,  RM3  has  4  direction 
vectors,  not  3. 

(ii)  The  length  of  the  last  dependence  vector,  corresponding  to  the 
local  aeaory,  equals  the  teaporal  displacement  between  two 
consecutive  occurrences  of  the  saae  processor  in  the  space-tine 
configuration.  Thns,  in  general,  this  displacenent  is  1,  except 
for  L2,  S3 ,  B4  whose  teaporal  displacenent  is  2  (corresponding  to 
a  periodicity  index  of  2),  and  except  for  H3b  whose  temporal 
displacenent  is  3  (corresponding  to  a  periodicity  index  of  3). 

Local  aeaory  can  also  be  used  to  interleave  computations  and  achieve 
increased  throughput  with  architectures  whose  relative  throughput  without 
aeaory  is  less  than  1.  This  possibility  will  be  discussed  in  Section  6.4. 

Analysis  of  equivalence  between  space-tine  configurations  with  local 
aeaory  reveals  that: 

(i)  LSI ,  which  has  2  linearly  independent  dependence  vectors,  is 
equivalent  to  L2,  R2. 

(ii)  RM2,  which  has  3  linearly  independent  dependence  vectors  is 
equivalent  to  S3,  B3a,  H3b. 

(iii)  EDO  a,  which  has  4  dependence  vectors,  is  equivalent  to  R4. 

In  all  three  cases  we  can  trade  interconnecting  links  for  aeaory,  thereby 


Figure  6-1.  The  Three  Fundaaentel  Topologies  with  Local  Memory 


reducing  the  number  of  physical  wires  required  to  construct  a  realization  of 
the  architecture  and  simplifying  the  layout  problem  for  VLSI  implementation. 
Thus,  for  instance,  the  R2  architecture  which  requires  a  planar  network  of 
processors  with  4  interconnecting  ports  at  each  processor  can  be  replaced  by 
LMl  which  requires  a  linear  network  of  processors  with  2  interconnecting 
ports  at  each  processor  and  a  local  memory.  Even  more  remarkably,  the  same 
replacement  also  trades  low  throughput  configurations  for  high  throughput 


6.3  BOUNDARY  ANALYSIS 


The  relation  between  boundary  shapes  and  equivalence  between  (planar) 
architectures  has  been  examined  in  Section  5.4  The  combination  of  topology 
and  boundary  has  produced  15  distinct  architectures  which  were  summarized  in 
Table  5-2.  Since  each  one  of  these  architectures  has  a  rectangular 
boundary,  the  resulting  space-time  configuration  always  occupies  a 
rectangular  prism  (with  the  exception  of  low-dimensional  architectures  such 
as  LI,  L2 ,  R2  whose  space-time  configurations  occupy  1  or  2-dimensional 
subspaces) . 

Since  linear  transformations  change  the  shape  of  the  boundary,  the 
equivalence  between  space-time  configuration,  discussed  in  Sections  6.1  - 
6.2,  has  to  be  reexamined  to  include  the  effects  of  boundary 
transformations.  It  will  be  sufficient  to  carry  out  this  analysis  only  for 
collections  of  space-time  configurations  which  have  been  declared  as 
equivalent  in  the  preceding  sections. 


6.3.1 


The  configurations  LMl,  R2  can  be  considered  equivalent  only  when  we 
assume  that  a  single  set  of  inputs  is  applied  to  R2  (rather  than  a  time- 
aeries).  In  this  case  R2  is  characterised  by 
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while  LK1  it  characterized  by 


M  101 

0  0  1 

B  "  "1_  5  "l" 

L  J  0  0  1 


aad  the  two  ere  equivalent,  being  related  by  a  linear  tranafonation,  viz.. 


t : :]  [si 


On  the  other  Band,  LIU  and  K2  are  not  eqnivalent  to  L2  for  which 


M  101 

-10  1 

_  ■  "r  o  ~r 
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The  H-part  of  this  characterization  can  be  related  to  the  D-part  of  LM1, 
viz. , 


:i] 


where  the  asterisks  denote  entries  which  can  be  chosen  arbitrarily  (subject 
to  the  nons ingularity  constraint  of  the  linear  tranafonation).  However, 
when  the  dependence  aiatriz  is  combined  with  the  boundary  matrix  we  obtain 
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which  does  not  match  the  B-psrt  of  L2.  When  the  inverse  of  this 
transfonation  is  applied  to  the  dependence  and  boundary  matrices  of  L2 , 


viz. , 
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the  resulting  conf iguretion  (Figure  6-2)  is  equivalent  to  an  LM1 
configuration  of  infinite  order;  the  finite  active  part  of  the  architecture 
is  shifted  one  cell  to  the  right  every  tine  a  new  input  is  applied.  Thus, 
in  sunmary,  LM1  and  R2  are  equivalent  to  each  other  but  not  to  L2. 


6.3.2  The  Configurations  RM2 .  R3 ■  H3a.  H3b 


The  truncated  boundary  natriz  of  these  configurations  was  chosen  in 
Section  5.4  as 


nanely,  the  rectangular  boundary.  The  corresponding  boundary  surface  in  the 
space-tine  continuun  is,  therefore,  characterized  by  the  boundary  natriz 


B 


10  1 
Oil 
0  0  1 


(6.1) 


When  this  boundary  natriz  is  conbined  with  the  dependence  natrices  of  RIG, 
R3 ,  B3aa,  H3a0,  H3b,  equivalence  is  destroyed.  For  instance,  trying  to 
relate  H3 aa  to  RIG  we  obtain 


D 

B 


H3aa 


0-10 
-10  0 
111 


10  1 
0  11 
111 
i"  0  ~l“ 
0  11 
0  0  1 


0-10 
-10  0 
1  1  1 
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0 

1 

i 
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o  1 

l  l 
o  1 
o  “l 
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The  resulting  D-part  coincides  with  the  dependence  natriz  of  RM2,  but  the 
boundary  surface  is  different.  The  configuration  obtained  above  by 
transforning  H3au  is  in  fact  an  RIG  hardware  of  infinite  order  in  which 
a  finite  active  segnent  shifts  along  the  diagonal  by  one  cell  each  tine  a 
set  of  inputs  is  applied  to  the  array.  This  is,  in  fact,  precisely  what 
happens  in  systolic  arrays  for  natriz  nultiplication.  The  configuration 
R3aa  (of  Weiser  and  Davis  [4])  it  suited  for  nultiplying  banded  natrices. 
When  the  sane  problen  is  inplenented  on  an  RIG  configuration  (of  S.T.  lung 
(71)  no st  cells  in  the  array  are  idle  while  a  snail  active  rectangle. 
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corresponding  to  the  bandwidth  of  the  given  matrices,  shifts  along  the  main 
diagonal  of  the  array.  In  analogy,  while  multiplying  two  natrices  with  no 
strnctnre  is  carried  ont  efficiently  by  an  RM2  array,  solving  the  same 
problen  on  an  H3 a  configuration  involves  many  idle  cells  and  a  small 
active  segment  that  shifts  along  the  main  diagonal. 


6.3.3  The  Conf ianratlons  HM3a.  R4 

These  configurations  have  the  same  boundary  matrix,  given  by  (6.1),  as 
RK2,  R3.  H3 a  and  H3b.  Since  their  dependence  matrices  are  different,  we 
conclude  that  HN3aa,  HM3ap,  R4  are  distinct  configurations  when  the  shape 
of  boundary  surface  is  taken  into  account. 

6.3.4  Summary 

Vhen  boundary  considerations  are  taken  into  account  each  of  the  15 
architectures  of  Table  5-2  is  distinct  and  cannot  be  related  by  equivalence 
to  any  other  architecture  in  this  table.  Incorporating  local  memory  results 
in  doubling  the  total  number  of  distinct  configurations  to  30. 

6.4  INTERLEAVING  ARCHITECTURES  BY  LOCAL  MEMORY 

The  introduction  of  local  memory  in  Section  6.2  involved  the  assumption 
that  locally  stored  data  remain  in  memory  until  required,  which  makes 
particular  sense  in  data  driven  realisation.  Consequently,  the  duration  of 
storage  for  some  architectures  (L2,  R3 ,  R4,  H3b)  was  longer  than  one  time 
unit.  This  fact  can  be  used  to  construct  new  architectures  with  higher 
throughput,  by  interleaving  computations  in  time  and  connecting  the 
interleaved  computations  via  the  local  memory. 

The  simplest  example  of  such  oonstruotion  is  the  architecture  L2 . 
Without  interleaving  the  -  throughput  of  L2,  LM2  is  1/2  (Figure  6-3a). 

With  interleaving,  which  involves  superimposing  in  time  two  L2  schemes  and 
interconnecting  them  via  local  memory,  the  resulting  Li M2  configuration 


(Figure  6-3b)  has  throughput  =  1.  A  similar  approach  produces  the 
architectures  RiM3 ,  RiM4  and  HiM3b,  whose  characterizations  are  given  in 
Table  6-3.  The  difference  between  these  architectures  and  their 
noninterleaved  counterparts  is  the  shortening  of  the  local  memory  dependence 
vector  from  either  [0  0  2]  or  [Q  0  3]  to  [0  0  1]. 

TABLE  6-3 .  DEPENDENCE  XATOICES  fOR  INTERLEAVED  ARCHITECTURES 


LiM2 

RiM3 

— *- 

RiM4 

HiM3b 

10  1 

10  1 

1 

0  1 

10  1 

-10  1 

H 

o 
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1 

-1 

0  1 

0  11 

0  0  1 

Oil 

0 

1  1 

rt 

H 

1 

H 

1 

0  0  1 

0 

-1  1 

0  0  1 

0 

0  1 

6.3  SUMMARY 

Space-time  configurations  have  been  classified  by  topology, 
interconnection  pattern,  shape  of  boundary,  existence  of  local  memory  and 
interleaving.  The  15  fundamental  architectures  of  Table  5-2  give  rise  to 
another  15  configurations  involving  local  memory.  These,  in  turn,  give  rise 
to  4  interleaved  configurations,  producing  a  total  of  34  distinct  space-time 
configurations. 


Ignoring  the  shape  of  the  boundary  surface  results  in  20  distinct 
conf igurations : 


1) 

LI 

11) 

H4aa,  H4a0 

2) 

LM1 ,  L2 ,  R2 

12) 

B4ba,  H4b0 

3) 

LM2 

13) 

RIM 

4) 

LiM2 

14) 

RiM4 

5) 

RM2 ,  S3 ,  H3ao,  H3a0,  H3b 

15) 

HM4aa,  HH4a0 

6) 

RM3 

16) 

HM4bo,  HM4b0 

7) 

RiM3 

17) 

H5o,  H5P 

8) 

HM3aa,  HM3a0,  R4 

18) 

HM5a,  HM50 

9) 

HM3b 

19) 

H6 

10) 

HiM3b 

20) 

HM6 

Ignoring,  in  addition,  the  details  of  local  memory  (and,  consequently,  of 
interleaving)  results  in  8  distinct  configurations  only  as  in  Table  6-1. 

Choosing  the  optiaial  configuration  for  a  given  computational  scheme 
requires  a  specification  of  both  the  interconnection  pattern  and  the 
boundary  shape.  This  can  be  accomplished  only  when  specific  details  of  the 
corresponding  computational  scheme  are  taken  into  account  (e.g.,  bandedness 
of  matrices  to  be  multiplied).  When  only  partial  information  is  considered 
the  designer  is  often  able  to  choose  the  interconnection  pattern  but  not  the 
boundary.  Thus,  multiplication  of  tvo  matrices  can  be  implemented  in  any  of 
the  five  equivalent  hardware  configurations  KM2  [7],  R3  [10],  H3aa  [4], 
H3s0,  H3b  [5].  However,  RM2  will  be  optimal  if  both  matrices  have  no 
particular  structure;  S3  will  be  optimal  if  only  one  of  the  matrices  is 
banded;  and  H3aa  (or  H3s0)  will  be  optimal  if  both  matrices  are  banded. 
It  is  an  historical  curiosity  that  the  first  systolic  array  for  matrix 
multiplication,  H3b,  ia  never  optimal,  because  it  has  relative  throughput 
of  1/3  and  is  otherwise  equivalent  to  H3a. 


SECTION  7 
CONQAiyipNS 


A  modeling  and  analytic  methodology  for  parallel  algorithms  and 
architectures  has  been  presented.  Nodular  competing  networks  (MCNs)  were 
introduced  at  a  unifying  concept  that  eaf  be  used  to  describe  both 
algorithms  amd  architectures.  Th»  representation  of  an  MCN  exhibits  all  the 
relevant  information  that  characterizes  a  parallel  processing  algorithm, 
from  precedence  relations  and  order  of  execution,  through  scheduling  and 
pipel inability  consideration,  to  map  compositions  and  global 
characterization.  It  clearly  displays  the  hierarchical  structure  of  a 
parallel  system  and  the  multiplicity  of  choices  for  hardware  implementation. 
Our  methodology  applies  both  to  arbitrary  (irregular)  networks  and  to 
iterative  ones.  Regularity  of  networks  translates  directly  into  regularity 
of  the  model  we  use  to  describe  them  and,  consequently,  into  regularity  of 
the  associated  architectures.  Problems  of  non-executability  (deadlocks, 
safeness,  etc.)  are  insignificant  in  our  methodology  and  can  be  easily 
detected  and  resolved. 

Infinite  MCNs,  which  occur  in  most  signal  processing  applications,  have 
been  characterized.  It  has  been  shown  that  the  key  property  for 
executability  of  such  networks  is  structural  finiteness  (in  addition  to 
absence  of  cycles,  of  course).  Infinite  MCNs  are  most  frequently  iterative, 
in  which  case  they  are  guaranteed  to  be  structurally  finite  and  can  be 
represented  by  a  finite  single-layer  diagram. 

Dimensionality,  pipelinability  and  throughput  have  been  introduced  as 
fundamental  structural  attributes  of  MCN  models.  Throughput  computations 
(see,  e.g.,  [9])  have  been  established  as  a  direct  consequence  of  the  notion 
of  schedule,  which  applies  to  every  MCN  model.  The  wavefront  concept 
[4,7,8]  has  been  shown  to  be  a  natural  outcome  of  associating  schedules  with 
iterative  networks.  Systolic-array-like  architectures  were  modeled  and 
analyzed  via  the  concept  of  completely  regular  MCNs. 


A  classification  of  canonical  realisations  for  completely  regular 
nodular  computing  networks  has  been  presented.  Three  levels  of  abstraction 
were  considered:  topology,  architecture  and  space-tine  representation.  The 
analysis  revealed  3  canonical  topologies,  15  canonical  architectures  and  34 
canonical  space-tine  configurations.  It  was  shown  that  the  uniouc  canonical 
counterpart  of  any  given  topology,  architecture  or  space-tine  configuration 
is  obtained  via  a  sinple  (and  unique)  transfornation  of  the  corresponding 
dependence  and  boundary  matrices.  It  was  also  shown  that  only  rectangular 
boundaries  are  required  to  implement  any  canonical  realization.  While 
ignoring  boundary  details  allows  some  flexibility  of  design,  it  also  results 
in  inefficient  implementations,  as  explained  in  Section  6.5. 

It  is  interesting  to  observe  that  only  a  small  fraction  of  the 
architectures  described  in  this  memo  have  actually  been  used  in  the  design 
of  parallel  algorithms.  The  most  commonly  encountered  architectures  are  the 
linear  ones  (L2,  LIN)  which  are  used  for  linear  filtering  (=  convolution, 
polynomial  multiplication)  and  related  computations.  Next  comes  the 
rectangnlar  architecture  RX2  and  its  equivalents — S3,  H3a,  H3b — which  are 
used  in  matrix  products,  matrix  triangularizations,  solutions  of  linear 
equations,  QR-factorizations  for  eigenvalue  problems,  and  adaptive 
multichannel  least-squares  algorithms.  Thus,  all  applications  involved,  to 
date,  only  architectures  with  3  dependence  vectors  or  less.  Notice  also 
that  the  classical  pipeline  (LI)  has  no  use  as  a  signal  processing 
architecture. 
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APPENDIX  A 


PROOF  OF  THEOREM  2.2  FOR  INFINITE  MCNs 


If  the  MCN  has  an  execution  then  it  aust  be  acyclic,  as  was  pointed  ont 
at  the  beginning  of  Section  2.3.  To  prove  the  converse  we  shall  construct 
an  execution  for  an  arbitrary  acyclic,  strnctnrally-f inite  MCN. 

First,  notice  that,  by  Theorem  2.1,  the  inpnts  of  the  MCN  can  be 
umbered.  Let  ns,  therefore,  denote  the  inpnts  by  {x^;  0  £  i  <  ®)  .  Next, 
recursively  define  a  sequence  of  sets  of  variables  {M^}  according  to  the 
following  rule: 


Mo  tzo} 


Mi+1  *i+l* 


Thus,  each  set  contains  one  new  inputs  of  the  MCN  and  all  the  iaaediate 
successors  of  the  preceding  set.  The  sets  M^  are  clearly  disjoint,  and, 
in  view  of  the  local-f initenesa  property,  each  M^  set  is  finite. 

Moreover,  every  variable  of  the  MCN  is  included  in  soae  M^  set,  because 
every  variable  is  either  a  global  input  or  a  finite  successor  of  soae  global 
input. _  Thus,  the  cascade 


is,  in  fact,  a  representation  of  the  network  as  a  cascade  of  finite 
(disjoint)  subnetworks.  Each  M^  set  is  finite,  hence  has  an  execution 
with  a  finite  auaber  of  levels.  If  we  replace  each  Mj  by  its  execution, 
we  obtain  a  refinement  of  the  previous  representation,  vix.. 


8nn  •  S 
00 


01 


*  S10  *  8 


11 


where  {S^)  are  the  levels  corresponding  to  the  set  Since  each 

is  finite,  this  is  clearly  an  execution  of  the  global  MCN. 
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APPENDIX  B 


ADMISSIBLE  ARCHITECTURES 


A  composition  of  processors  is  celled  admissible  if  the  following  three 
conditions  are  satisfied: 

(i)  There  are  no  dangling  inputs  or  outputs. 

(ii)  There  are  no  directed  cycles. 

( iii)  The  architecture  is  structurally  finite. 

Each  of  the  processors  comprising  an  architecture  can  itself  be  a 
composition  of  more  elementary  processors.  The  hierarchical  nature  of  the 
admissibility  property  implies  the  following  result. 


7>eprem  P,1 


An  admissible  composition  of  admissible  architectures  is  itself  an 
admissible  architecture. 


Eeoo£: 


The  theorem  states  that  the  three  properties  making  up  admissibility 
should  he  exhibited  by  the  composite  architecture,  if  they  were  exhibited  by 
each  of  the  subnetworks. 

(i)  The  composite  architecture  has  no  dangling  terminals,  because 

every  terminal  is  connected  to  acme  subnetwork  (by  admissibility 
of  the  composition)  and  no  subnetwork  has  dangling  terminals  (by 
admissibility  of  the  subnetworks). 

(11)  The  composite  architecture  has  no  eycles  because  neither  the 
subnetwork  nor  the  eompoeition  hae  cycles. 


(ill)  Structural  finiteness  la  made  op  of  the  three  following 
propertlea:  Local  finiteness,  finite  ancestries,  and 
countability  of  connected  components.  Local  finiteneaa  ia 
inherited  by  the  composite  architeetnre  becanse  composition  does 
not  change  the  number  of  input s/ontputs  of  processors  within 
each  subnetwork.  To  prove  that  the  finite  ancestry  property  is 
also  inherited  by  the  composite  architecture  it  will  be 
sufficient  to  consider  a  single  variable  z.  Suppose  that  z 
belongs  to  some  subnetwork  Up  By  the  admissibility  of  the 
composition,  igj  has  a  finite  number  of  ancestor  subnetworks. 
The  ancestry  of1  z  is  obtained  by  tracing  back  the  ancestry 
relation  through  the  finite  collection  of  subnetworks  «(&.). 

And  since  each  subnetwork  is  admissible,  the  portion  of  eiz) 
within  each  ancestor  of  ‘&1,  is  also  finite,  hence  a(z)  itself 
is  finite.  Finally,  an  admissible  composition  has  a  countable 
number  of  subnetworks  (see  Theorem  2.1)  and  each  subnetwork  has, 
by  assumption,  a  countable  number  of  connected  components. 

Hence,  the  total  nmber  of  connected  components  in  the  composite 
network  is  countable,  too. 
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APPENDIX  C 


PROOF  OF  THEOREM  2.3 
MINIMAL  EXECUTIONS  OF  FINITE  MCNs 


Every  execution  determines  a  numbering  E(  )  of  the  variables  of  an 
MCN,  via.. 


x  e  &i  <*— >  E(x)  -  i 

This  integer  valued  function  satisfies  the  inequality  (see  Section  3.1) 

E(x)  -  1  2.  aax  (E(y)  i  y  e  o(x)J  (C.l) 

7 

A 

Every  finite  directed  acyclic  graph  has  a  unique  nuabering  E(  )  of  its 
arcs  (or  equivalently  of  its  vertices)  that  satisfies  the  equality 

E(x)  -  1  •  aax  (E(y) ;  y  s  o(x))  (C.2) 

y 


This  well-known  result  (see,  e.g.,  [14])  iaplies  that  every  finite 
executable  MCN  has  a  unique  execution  that  satisfies  (C.2).  Ve  shall  call 
this  unique  execution  ainiaal  for  reasons  that  will  becoae  clear  in  the 
sequel. 

Let  E(  )  be  an  arbitrary  non-ainiaal  execution.  Then,  there  exists 
aoae  variable  x  for  which  the  strict  inequality 

E(x)  -  1  >  aax  (E(y) ;  y  s  a(x)J 

y 

holds.  This  aesns  that  x  is  evaluated  several  steps  after  all  its 
ancestors  becaae  available.  Consequently,  the  nuabering  of  x  can  be 
aodified  to  1  +  aax  (E(y) ;  y  a  u(x))  without  violating  the  precedence 

y 


relation.  Ve  (hall  refer  in  the  sequel  to  thia  modification  as  an 
elementary  shift. 

Each  execution  is  a  series-parallel  combination  and  consequently  has  a 
well  defined  input-output  map.  Elementary  shifts  replace  expressions  of  the 
form  e*e*...*p  by  expressions  of  the  form  p*e*...*e  (see  Figure  C-l) . 

If  the  physically  justifiable  identity 

p*e  =■  e*p  (C.3) 

it  added  as  an  axion  of  the  theory  of  MCNs  (see  Appendix  D) ,  we  conclude 
that  input-output  maps  remain  invariant  under  elementary  shifts.  This  leads 
to  the  following  result. 


Theorem  C.l 

Every  execution  E(  )  of  a  finite  MOJ  can  be  transformed  by  a  finite 
number  of  elementary  shifts  into  the  unique  minimal  execution. 

tTPSf 

The  minimal  execution  E(  )  is  constructed  by  the  following  simple 
algorithm  (see.  e.g..  [14]): 

(i)  Put  all  the  global  inputs  of  the  MCN  in  Sq. 

(ii)  For  ^i  *  0,lj2,...  put  all  the  immediate  successors  of  members 

of  Sj  in  . 
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Nov,  if  E(  .)  is  s  nonminiual  execution  ve  transform  it  into  E(  )  by  tbe 
following  rule: 

For  i  *  0,1,2,...  shift  all  Members 
of  from  E(x)  to  E(x)  *  i. 


Since  the  MCN  is  finite,  a  finite  number  of  shifts  will  transform  E(  ) 
into  E(  ) .  Notice  that  each  variable  is  shifted  exactly  once.  Also  notice 
that  by  its  construction,  the  number  E(x)  is  equal  to  the  lengths  of  the 

A 

shortest  path  connecting  x  to  some  global  input.  Bence,  E(x)  cannot  be 
further  reduced.  _ 


The  minimal  execution  E(  )  satisfies  E(x)  <.  E(x)  for  every  variable 
x  and  for  every  execution  E(  ). 


A  finite  executable  MCN  has  a  unique  well-defined  input-output  map. 
This  is  so  because  all  executions  define  the  same  map,  by  Theorem  C.l. 


J  . 


Corollary  C.l. 2  establishes  the  theorem  for  finite  MCNs.  For  infinite 
networks  it  will  be  sufficient  to  prove  that  for  every  execution  E(  )  and 
for  every  variable  x  the  map  from  global  inputs  to  x  is  unique  and  does 
not  depend  upon  the  choice  of  execution.  However,  E(  )  induces  some 
execution  on  the  finite  MCN  corresponding  to  o(x),  the  finite  ancestry  of 
x.  Therefore,  the  map  from  inputs  to  x  coincides,  for  every  ehoice  of 
E(  ),  with  the  unique  msp  determined  by  the  minimal  execution  on  o(x). 
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It  is  interesting  to  notice  tbst  sn  infinite  MCN  does  not  have,  in 
general,  s  minimal  execution.  The  construction  procedure  described  in  the 
proof  of  Theorem  C.l  is  still  valid,  but  are,  in  genersl,  inf inite  end 

do  not  determine  a  valid  execution. 


APPENDIX  D 


ELEMENTARY  EQUIVALENCE  TRANSFORMATIONS 


The  general  theory  of  MCNs  does  not  involve  any  specific  assumptions 

about  the  properties  of  the  processor  naps  {f  }.  Consequently,  there  are 

P 

only  a  few  equivalence  transformations  that  are  still  valid  in  this  general 
franework.  Most  equivalence  transformations  used  with  block-diagrams  and 
signal-flow-graphs  involve  linearity  assumptions  and  do  not  hold  for  general 
nonlinear  maps. 

Two  basic  maps,  the  identity  map  e  and  the  split  map  s  can  be  used 
in  conjunction  with  any  MCN  manipnlation.  The  identity  map  leaves  its  input 
variables  unchanged,  viz.. 


e(x)  «  * 


The  split  map  duplicates  input  variables,  viz.. 


s(x)  »  (x,x) 


It  is  possible,  of  course,  to  have  more  than  two  copies  of  the  same 
variable,  either  by  introducing  a  split  processor  with  several  outputs,  or 
by  using  several  two-output  split  processors. 

The  properties  of  the  identity  and  split  processors  give  rise  to 
several  elementary  equivalence  transformations  (Figure  D-l) : 


(a)  The  identity  commutes  with  any  other  processor  f. 

(b)  The  cascade  of  a  processor  f  and  its  inverse  f  1  can  be 
replaced  by  an  identity  processor,  provided  the  processor  f  has 
an  Inverse. 


(c)  The  split  processor  'commutes'  with  any  processor  f. 


(d)  Any  processor  f  with  two  outputs  oan  be  replaced  by  a 

composition  of  a  split  processor  and  two  single  output  processors 
f . ,  f,.  The  processors  f.,  f.  correspond  to  the  maps  from 
inputs  to  each  of  the  two  outputs,  respectively. 


a.  Commutativity  of  the  Identity 


b.  The  Inverse  Processor 


c.  Commute tiv ity  of  the  Split 


d.  Splitting  of  Multivariable  Ootpnts 


Fignre  D-l .  Elementary  Equivalence  Transformations 


APPENDIX  E 


r^r jfuajhmi)  amj  jim^u  mu  *  wj  $  jjj 


The  multiplication  of  two  matrices  involves  the  computation  of  inner 
products  between  every  row  of  one  matrix  and  every  column  of  the  other  one. 
To  emphasize  this  interpretation  we  shall  consider  in  the  sequel  the  product 

C  A*B 

so  that  the  inner  products  are  between  columns  of  A  and  columns  of  B.  In 
fact,  is  precisely  the  inner  product  between  the  i-th  column  of  A 

and  the  j-th  column  of  B.  Consequently,  we  can  compute  the  product  by 
feeding  the  columns  of  A.B,  which  we  denote  by  a^,  bj,  into  the  MCN  of 
Figure  E-l.  Each  input  is  a  column  vector  which  is  propagated  without 
modification  through  the  network.  The  a,b  inputs  of  each  processor 
propagate  through  without  modification  and  the  inner  product  of  the  two 
input  vectors  is  computed  inside  the  processor.  This  multichannel 
configuration  can  be  further  decomposed  by  observing  that  the  inner  products 
can  be  computed  recursively,  i.e.,  if  c  :•  a*b  where  a  *  {o^} ,  b  ■  (^) 
are  column  vectors  of  length  N,  then  c  B  c^  where 

ci  "  ci-l  +  aiPi*  co  “  0 

Thus,  every  single  processor  in  Figure  E-l  is,  in  fact,  a  cascade  of  basic 
'multiply  and  add'  processors  (Figure  E-2).  When  this  decomposition  is 
combined  with  the  architecture  of  Figure  E-l,  we  obtain  the  MCN  for  matrix 
multiplication.  Figure  E-3  displays  a  side  view  of  this  3-D  network  whose 
top  view  is  shown  in  Figure  E-l.  The  complete  MCN  consists  of  N 
horizontal  layers  such  as  in  Figure  E-l  arranged  in  a  vertical  stack. 
Equivalently,  we  may  say  the  MCN  consists  of  three  vertical  layers  such  as 
in  Figure  E-3  arranged  behind  each  other.  It  is  important  to  notice  that 
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a.  The  Complete  Network 


! 

I 


S 


Figure  E-l.  A  Basic  Matrix  Multiplier 


the  direction  of  the  C-paths  can  be  either  from  top  to  bottom,  at  shown  in 
Figure  E-3,  or  from  bottom  to  top.  This  is  a  consequence  of  the 
commutativity  and  associativity  of  addition,  viz.. 


N 

]£  Vi 

i*=l 


]£  Vi 

i-N  1 


This  means  that  there  are  two  distinct  MCNs  that  correspond  to  matrix 
multiplication  and  they  differ  only  by  the  direction  of  the  C-paths. 

Every  architecture  for  matrix  multiplication  is  equivalent  to  the  MCN 
of  Figure  E-3.  The  various  architectures  are  obtained  by  imposing 
additional  constraints  upon  the  matrices  (i.e.,  bandedness)  and  rearranging 
the  resulting  reduced  MCN  as  a  space-time  diagram.  The  corresponding  self- 
timed  block-diagram  follows  immediately  from  this  rearrangement. 

The  matrix  multiplier  of  S.Y.  lung  (7]  is  obtained  by  interpreting  the 

vertical  dimension  in  Figure  E-3  as  'time.'  Since  vertical  arrows 

correspond  to  local  storage,  the  resulting  block-diagram  is  described  in 

Figure  E-4  (notice  the  similarity  with  E-l).  The  elements  of  each  column 

vector  a.,  b  are  fed  sequentially  into  the  array  and  each  processor  has  a 
1  J  e 

self-loop  which  computes  the  inner-product  c^  “  a ^  b^  recursively  in 

time. 

The  matrix  multiplier  of  S.  Rao  [10]  is  designed  for  a  banded  B 

matrix.  It  will  be  sufficient  to  analyze  it  for  a  single  column  of  A,  say 

a^.  The  MCN  of  Figure  E-3  now  has  only  one  vertical  layer,  and  many 

processors  in  this  layer  have  zero  inputs  and  can  be  eliminated.  The 

resulting  reduced  MCN  is  shown  in  Figure  E-5a.  Dummy  processors,  shown  in 

broken  line,  were  added  to  emphasize  the  tridiagonsl  nsture  of  the  MOJ.  A 

self-timed  block-diagram  (Figure  E-Sb)  is  obtained  by  considering  the 

diagonal  axis  as  'time.'  It  consists  of  a  linear  array  of  identical 

processors,  one  for  each  nonzero  diagonal  of  the  banded  matrix  B.  The 

elements  of  B  are  fed  into  the  array  by  diagonals.  The  elements  of  A,  C 

are  handled  by  columns:  Every  column  of  A  produces  a  row  of  C  and 

requires  a  linear  array  as  in  Figure  E-3b.  It  is  interesting  to  notice  that 

the  input  interval  of  this  matrix  multiplier  is  x  ♦  x  where  x  is  the 

c  »  c 


i.  Self-Time  Block-Diagram 


Figure  E-4.  The  Matrix  Multiplier  of  S.T.  lung 


time  required  to  compute  * c*  end  t  it  the  time  required  to  propagate 
'a*  through  one  processor  When  the  direction  of  the  C-path  or, 
equivalently,  of  the  A-path,  is  reversed  the  input  interval  becomes 

x  -  x  .  Since  x  <<  x  the  two  networks  differ  only  slightly  in  their 

C  ft  ft  c 

throughput.  However,  we  shall  presently  encounter  another  example  where  the 
reversal  of  the  C-path  results  in  a  large  increase  in  throughput. 

The  matrix  multiplier  of  H.T.  Sung  is  designed  for  banded  A,  B 
matrices.  This  means  that  the  active  processors  in  the  non-reduced  MO!  of 
Figures  E-l  and  E-3  are  located  within  a  parallelepiped  aligned  with  one  of 
the  main  diagonals  of  the  rectangular  priam  representing  the  non-reduced 
MCN .  A  simple  illustration  of  the  reduced  MCN  ia  obtained  by  considering 
two  adjacent  horizontal  layers  (Figure  E-6).  When  we  slide  the  horizontal 
layers  so  that  they  overlap,  the  resulting  network  corresponds  to  H.T. 
lung's  multiplier  (Figure  E-7).  This  network  clearly  has  an  input  interval 
of  x  +  lx  .  However,  if  we  reverse  the  C-path  we  obtain  the  configuration 

C  ft 

of  Weiser  and  Davis  [4]  (Figure  E-8)  which  has  an  input  interval  of 

lx  -  2x  | .  The  difference  between  the  two  multipliers  is  significant  when 

C  ft 

they  are  implemented  by  single  rate  systolic  arrays.  In  this  case 

x  *  x  «  x  so  that  the  former  network  has  an  input  interval  of  3x  while 

c  a 

the  latter  has  an  input  interval  of  x! 


layer  1 


layer  i+1 


Fi| 


b.  Top  View 


Figure  E-S.  The  Matrix  Multiplier  of  leieer 


APPENDIX  P 


EQUIVALENCE  VIA  LINEAR  TRANSFORMATIONS 


Two  dependence  matrices,  say,  ,  D^,  are  considered  eqnivslent  when 
there  exists  s  nonsingnlsr  linear  transformation  T  and  a  permutation 
matrix  P  snch  that 


PDjT 


(F.l) 


This  relation  is  clearly  reflexive  (with  P  *  I,  T  *  I),  symmetric  and 
transitive,  so  'equivalence'  is  indeed  an  equivalence-type  relation. 

Denoting  the  length  of  dependence  vectors  by  n,  and  the  number  of 
dependence  vectors  by  p,  we  conclude  that  every  dependence  matrix  with 
p  <.  n  and  full  (row)  rank  is  equivalent  to 

D<P)  :«  [I  0]  (F.2) 

P 


which  will  be  defined  as  the  canonical  equivalent  of  such  dependence 
matrices.  When  p  >  n,  and  the  dependence  matrix  has  full  (column)  rank, 
we  can  always  find  a  permutation  matrix  P  so  that 


PD  - 


T 


(F.3) 


where  T  consists  of  the  first  n  rows  of  the  permuted  matrix  PD.  Thus, 
the  canonical  equivalent  of  dependence  matrices  with  p  >  n  is  of  the  form 
(F.3)  and  the  properties  of  D  can  be  studied  by  examining  the  structure  of 
the  smaller  matrix  X. 

However,  since  the  aubmatrix  X  in  (F.3)  is  not  unique,  it  is  required 
first  to  find  all  possible  canonical  equivalents  to  a  given  dependence 
matrix  D.  This  can  be  done  by  applying  all  possible  pi  permutations  P 
to  the  rows  of  D  and  then  computing  X  via  (F.3).  However,  not  all 


In  inuiry,  once  all  possible  canonical  equivalents  of  a  given  D  have 
been  coaipnted  it  is  relatively  easy  to  test  whether  some  other  dependence 

A 

matrix  D  is  equivalent  to  D.  One  only  needs  to  compute  a  single 
canonical  equivalent  of  D  and  compare  it  to  the  collection  of  canonical 
equivalents  of  D:  a  match  indicates  that  D  is  indeed  equivalent  to  D. 
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